python - How to scrap websites filtering by its keywords in metadata? -


i have writen scraper suposed scrap websites keywords matching given ones. code:

class myspider(crawlspider):     name = 'smm'     allowed_domains = []     start_urls = ['http://en.wikipedia.org/wiki/social_media']     rules = (             rule(sgmllinkextractor(deny=('statcounter.com/','wikipedia','play.google','books.google.com','github.com','amazon','bit.ly','wikimedia','mediawiki','creativecommons.org')), callback="parse_items", follow= true),              )     def parse_items(self, response):         items = []         #define keywords present in metadata scrap webpage         keywords = ['social media','social business','social networking','social marketing','online marketing','social selling',             'social customer experience management','social cxm','social cem','social crm','google analytics','seo','sem',             'digital marketing','social media manager','community manager']         #extract webpage keywords          metakeywords = response.xpath('//meta[@name="keywords"]').extract()         #discard empty keywords         if metakeywords != []:         #compare keywords , extract if 1 of defined keyboards present in metadata             if (keywords in metakw metakw in metakeywords):                 link in response.xpath("//a"):                     item = socialmediaitem()                     item['sourcetitle'] = link.xpath('/html/head/title').extract()                     item['targettitle'] = link.xpath('text()').extract()                     item['link'] = link.xpath('@href').extract()                     item['webkw'] = metakeywords                     outbound = str(link.xpath('@href').extract())                     if 'http' in outbound:                         items.append(item)         return items 

however, think missing something, scraps websites none of gicen keywords. can solve problem? thanks!

dani

if want check if keyword in metakeywords list use any:

if any(key in metakeywords key in keywords):  

Comments

Popular posts from this blog

java - Plugin org.apache.maven.plugins:maven-install-plugin:2.4 or one of its dependencies could not be resolved -

Round ImageView Android -

How can I utilize Yahoo Weather API in android -