regex - Partial Matching two data frames having a common column(by words) in R/Python -

- April 15, 2014

i have 2 dataframes csv files df1 has more rows df2:

df1

name                         count xxx yyyyyy bbb cccc           15 fffdd 444 ggg                 20 kkbbb ccc dd 29p              5 22 cc pbc2 kmn3 b23 efgh      4 ccccccccc sss qqqq            2

df2

name xxx yyyyyy bbb cccc ccccccccc sss qqqq pppc 22 cc pbc2 kmn3 b23,efgh

i want partial matching(approximate/fuzzy matching) matching either first two/three words. output this:

output:

name                       count xxx yyyyyy bbb cccc         15 22 cc pbc2 kmn3 b23 efgh    4 ccccccccc sss qqqq          2

by trying exact matching, i'm missing of rows. tried agrep in r somehow not working , fuzzy matching quite slow. please suggest me way in r or python. appreciated!

in r, can use agrep fuzzy matching. can use max.distance parameter set maximum distance allowed match.

df1[sapply(df2$name, agrep, df1$name, max.distance = 0.2), ]  #                       name count # 1      xxx yyyyyy bbb cccc    15 # 5       ccccccccc sss qqqq     2 # 4 22 cc pbc2 kmn3 b23 efgh     4

the data:

df1 <- read.table(text = "name                         count 'xxx yyyyyy bbb cccc'           15 'fffdd 444 ggg '                20 'kkbbb ccc dd 29p'              5 '22 cc pbc2 kmn3 b23 efgh'      4 'ccccccccc sss qqqq'           2", header = true, stringsasfactors = false)  df2 <- read.table(text = "name 'xxx yyyyyy bbb cccc' 'ccccccccc sss qqqq pppc' '22 cc pbc2 kmn3 b23,efgh'", header = true, stringsasfactors = false)

Search This Blog

Deter

regex - Partial Matching two data frames having a common column(by words) in R/Python -

Comments

Post a Comment

Popular posts from this blog

java - Unable to make sub reports with Jasper -

java - Plugin org.apache.maven.plugins:maven-install-plugin:2.4 or one of its dependencies could not be resolved -

Save and close a word document by giving a name in R -