Clean the data in an efficient way in Python -

- April 15, 2014

i have data in following format :

top (s (pp-loc (in in) (np (np (dt an) (nnp oct.) (cd 19) (nn review) ) (pp (in of) (np () (np-ttl (dt the) (nn misanthrope) ) ('' '') (pp-loc (in at) (np (np (nnp chicago) (pos 's) ) (nnp goodman) (nnp theatre) )))) (prn (-lrb- -lrb-) () (s-hln (np-sbj (vbn revitalized) (nns classics) ) (vp (vbp take) (np (dt the) (nn stage) ) (pp-loc (in in) (np (nnp windy) (nnp city) )))) (, ,) ('' '') (np-tmp (nn leisure) (cc &) (nns arts) ) (-rrb- -rrb-) ))) (, ,) (np-sbj-2 (np (np (dt the) (nn role) ) (pp (in of) (np (nnp celimene) ))) (, ,) (vp (vbn played) (np (-none- *) ) (pp (in by) (np-lgs (nnp kim) (nnp cattrall) ))) (, ,) ) (vp (vbd was) (vp (advp-mnr (rb mistakenly) ) (vbn attributed) (np (-none- *-2) ) (pp-clr (to to) (np (nnp christina) (nnp haag) )))) (. .) ))

(top (s (np-sbj (nnp ms.) (nnp haag) ) (vp (vbz plays) (np (nnp elianti) )) (. .) ))

..... (there 7000 more..)

this data taken newspaper. new line new sentence (begins 'top') data need bold parts (without parenthesis) each sentence:

(in in)(dt an) (nnp oct.) (cd 19) (nn review) (in of) (`` ``) (dt the) (nn misanthrope)   ('' '')  (in at)  (nnp chicago) (pos 's) (nnp goodman) (nnp theatre)(-lrb- -lrb-) (`` ``)     (vbn revitalized) (nns classics) (vbp take) (dt the) (nn stage)  (in in)   (nnp windy) (nnp    city) (, ,) ('' '') (nn leisure) (cc &) (nns arts) (-rrb- -rrb-)(, ,) (dt the) (nn role)(in of)  (nnp celimene) (, ,) (vbn played) (-none- *)(in by)(nnp kim) (nnp cattrall) (, ,) (vbd was)  (rb mistakenly)(vbn attributed) (-none- *-2) (to to)(nnp christina) (nnp haag) (. .)  (nnp ms.) (nnp haag) (vbz plays)(nnp elianti)(. .)

i tried following:

f = open('filename') data = f.readlines() f.close()

this part crate array of tuples each row (using regular expressions):

tag_word_train = numpy.empty((5000), dtype = 'object') in range(0,5000) :     tag_word_train[i] = re.findall(r'\(([\w.-]+)\s([\w.-]+)\)',data[i])

it takes long time couldn't tell if correct

do have idea how in efficient way?

thanks,

hadas

nltk.tree provides functions both read in parse , extract pairs of words , part-of-speech tags want in output:

>>> import nltk.tree >>> t = nltk.tree.tree.fromstring("(top (s (np-sbj (nnp ms.) (nnp haag) ) (vp (vbz plays) (np (nnp elianti) )) (. .) ))") >>> t.pos() [('ms.', 'nnp'), ('haag', 'nnp'), ('plays', 'vbz'), ('elianti', 'nnp'), ('.', '.')]

Search This Blog

Deter

Clean the data in an efficient way in Python -

Comments

Post a Comment

Popular posts from this blog

java - Unable to make sub reports with Jasper -

java - Plugin org.apache.maven.plugins:maven-install-plugin:2.4 or one of its dependencies could not be resolved -

Save and close a word document by giving a name in R -