regex - Partial Matching two data frames having a common column(by words) in R/Python -


i have 2 dataframes csv files df1 has more rows df2:

df1

name                         count xxx yyyyyy bbb cccc           15 fffdd 444 ggg                 20 kkbbb ccc dd 29p              5 22 cc pbc2 kmn3 b23 efgh      4 ccccccccc sss qqqq            2 

df2

name xxx yyyyyy bbb cccc ccccccccc sss qqqq pppc 22 cc pbc2 kmn3 b23,efgh 

i want partial matching(approximate/fuzzy matching) matching either first two/three words. output this:

output:

name                       count xxx yyyyyy bbb cccc         15 22 cc pbc2 kmn3 b23 efgh    4 ccccccccc sss qqqq          2 

by trying exact matching, i'm missing of rows. tried agrep in r somehow not working , fuzzy matching quite slow. please suggest me way in r or python. appreciated!

in r, can use agrep fuzzy matching. can use max.distance parameter set maximum distance allowed match.

df1[sapply(df2$name, agrep, df1$name, max.distance = 0.2), ]  #                       name count # 1      xxx yyyyyy bbb cccc    15 # 5       ccccccccc sss qqqq     2 # 4 22 cc pbc2 kmn3 b23 efgh     4 

the data:

df1 <- read.table(text = "name                         count 'xxx yyyyyy bbb cccc'           15 'fffdd 444 ggg '                20 'kkbbb ccc dd 29p'              5 '22 cc pbc2 kmn3 b23 efgh'      4 'ccccccccc sss qqqq'           2", header = true, stringsasfactors = false)  df2 <- read.table(text = "name 'xxx yyyyyy bbb cccc' 'ccccccccc sss qqqq pppc' '22 cc pbc2 kmn3 b23,efgh'", header = true, stringsasfactors = false) 

Comments

Popular posts from this blog

java - Unable to make sub reports with Jasper -

sql - The object name contains more than the maximum number of prefixes. The maximum is 2 -

scala - play framework: Modules were resolved with conflicting cross-version suffixes -