python - XPath for sub-element's text value in lxml -
first of all, possible such thing?
i have been trying out generate xpath expression using "sub-element text values" present in webpage. trying using lxml (etree, html, getpath), elementtree modules in python. don't know how generate xpath expression value present in webpage. totally know scrapy framework in python, different.
below incomplete code..
import urllib2, re lxml import etree def wgeturl(target): try: req = urllib2.request(target) req.add_header('user-agent', 'mozilla/5.0 (windows; u; windows nt 5.1; en-gb; rv:1.9.0.3 gecko/2008092417 firefox/3.0.3') response = urllib2.urlopen(req) outtxt = response.read() response.close() except: return '' return outtxt newurl = 'http://www.iupui.edu/~webtrain/tutorials/tables.html' # homepage dt = wgeturl(newurl) parser = etree.htmlparser() tree = etree.fromstring(dt, parser)
as per lxml documentation creating element tree manually, how can use read , parsed html data (in example variable tree
or data
) access sub-element. or more importantly, if possible sub-element text value.
let's in above example webpage, want search table "supplies , expenses" , generate xpath expression dynamically value - supplies , expenses
is there option !!! ultimate goal, achieve read webpage , generate xpath sub-element text value present in webpage.
to find elements based on part of text value:
"//*[contains(text(), 'some_value')]"
for example, if have this:
<div id="somediv"> <span>something here</span> <a href="#">click here</a> </div>
you can find sub-elements containing word "here" this:
"//div[@id='somediv']//*[contains(text(), 'here')]"
or can example find sub-div span
elements containing word "something":
"//div[@id='somediv']//span[contains(text(), 'something')]"
as parsing in lxml
:
from lxml import etree outtxt = response.read() root = etree.fromstring(outtxt) root.xpath("my_xpath_expression")
update:
to full xpath expression element, use elementtree.getpath()
method, so:
tree = etree.elementtree(root) # print xpath of # elements in 'root' e in root.iter(): print tree.getpath(e)
Comments
Post a Comment