Clean the data in an efficient way in Python -


i have data in following format :

top (s (pp-loc (in in) (np (np (dt an) (nnp oct.) (cd 19) (nn review) ) (pp (in of) (np () (np-ttl (dt the) (nn misanthrope) ) ('' '') (pp-loc (in at) (np (np (nnp chicago) (pos 's) ) (nnp goodman) (nnp theatre) )))) (prn (-lrb- -lrb-) () (s-hln (np-sbj (vbn revitalized) (nns classics) ) (vp (vbp take) (np (dt the) (nn stage) ) (pp-loc (in in) (np (nnp windy) (nnp city) )))) (, ,) ('' '') (np-tmp (nn leisure) (cc &) (nns arts) ) (-rrb- -rrb-) ))) (, ,) (np-sbj-2 (np (np (dt the) (nn role) ) (pp (in of) (np (nnp celimene) ))) (, ,) (vp (vbn played) (np (-none- *) ) (pp (in by) (np-lgs (nnp kim) (nnp cattrall) ))) (, ,) ) (vp (vbd was) (vp (advp-mnr (rb mistakenly) ) (vbn attributed) (np (-none- *-2) ) (pp-clr (to to) (np (nnp christina) (nnp haag) )))) (. .) ))

(top (s (np-sbj (nnp ms.) (nnp haag) ) (vp (vbz plays) (np (nnp elianti) )) (. .) ))

..... (there 7000 more..)

this data taken newspaper. new line new sentence (begins 'top') data need bold parts (without parenthesis) each sentence:

(in in)(dt an) (nnp oct.) (cd 19) (nn review) (in of) (`` ``) (dt the) (nn misanthrope)   ('' '')  (in at)  (nnp chicago) (pos 's) (nnp goodman) (nnp theatre)(-lrb- -lrb-) (`` ``)     (vbn revitalized) (nns classics) (vbp take) (dt the) (nn stage)  (in in)   (nnp windy) (nnp    city) (, ,) ('' '') (nn leisure) (cc &) (nns arts) (-rrb- -rrb-)(, ,) (dt the) (nn role)(in of)  (nnp celimene) (, ,) (vbn played) (-none- *)(in by)(nnp kim) (nnp cattrall) (, ,) (vbd was)  (rb mistakenly)(vbn attributed) (-none- *-2) (to to)(nnp christina) (nnp haag) (. .)  (nnp ms.) (nnp haag) (vbz plays)(nnp elianti)(. .) 

i tried following:

f = open('filename') data = f.readlines() f.close() 

this part crate array of tuples each row (using regular expressions):

tag_word_train = numpy.empty((5000), dtype = 'object') in range(0,5000) :     tag_word_train[i] = re.findall(r'\(([\w.-]+)\s([\w.-]+)\)',data[i]) 

it takes long time couldn't tell if correct

do have idea how in efficient way?

thanks,

hadas

nltk.tree provides functions both read in parse , extract pairs of words , part-of-speech tags want in output:

>>> import nltk.tree >>> t = nltk.tree.tree.fromstring("(top (s (np-sbj (nnp ms.) (nnp haag) ) (vp (vbz plays) (np (nnp elianti) )) (. .) ))") >>> t.pos() [('ms.', 'nnp'), ('haag', 'nnp'), ('plays', 'vbz'), ('elianti', 'nnp'), ('.', '.')] 

Comments

Popular posts from this blog

java - Plugin org.apache.maven.plugins:maven-install-plugin:2.4 or one of its dependencies could not be resolved -

Round ImageView Android -

How can I utilize Yahoo Weather API in android -