python - How to scrap websites filtering by its keywords in metadata? -
i have writen scraper suposed scrap websites keywords matching given ones. code:
class myspider(crawlspider): name = 'smm' allowed_domains = [] start_urls = ['http://en.wikipedia.org/wiki/social_media'] rules = ( rule(sgmllinkextractor(deny=('statcounter.com/','wikipedia','play.google','books.google.com','github.com','amazon','bit.ly','wikimedia','mediawiki','creativecommons.org')), callback="parse_items", follow= true), ) def parse_items(self, response): items = [] #define keywords present in metadata scrap webpage keywords = ['social media','social business','social networking','social marketing','online marketing','social selling', 'social customer experience management','social cxm','social cem','social crm','google analytics','seo','sem', 'digital marketing','social media manager','community manager'] #extract webpage keywords metakeywords = response.xpath('//meta[@name="keywords"]').extract() #discard empty keywords if metakeywords != []: #compare keywords , extract if 1 of defined keyboards present in metadata if (keywords in metakw metakw in metakeywords): link in response.xpath("//a"): item = socialmediaitem() item['sourcetitle'] = link.xpath('/html/head/title').extract() item['targettitle'] = link.xpath('text()').extract() item['link'] = link.xpath('@href').extract() item['webkw'] = metakeywords outbound = str(link.xpath('@href').extract()) if 'http' in outbound: items.append(item) return items
however, think missing something, scraps websites none of gicen keywords. can solve problem? thanks!
dani
if want check if keyword in metakeywords
list use any:
if any(key in metakeywords key in keywords):
Comments
Post a Comment