regex - Partial Matching two data frames having a common column(by words) in R/Python -
i have 2 dataframes csv files df1 has more rows df2:
df1
name count xxx yyyyyy bbb cccc 15 fffdd 444 ggg 20 kkbbb ccc dd 29p 5 22 cc pbc2 kmn3 b23 efgh 4 ccccccccc sss qqqq 2 df2
name xxx yyyyyy bbb cccc ccccccccc sss qqqq pppc 22 cc pbc2 kmn3 b23,efgh i want partial matching(approximate/fuzzy matching) matching either first two/three words. output this:
output:
name count xxx yyyyyy bbb cccc 15 22 cc pbc2 kmn3 b23 efgh 4 ccccccccc sss qqqq 2 by trying exact matching, i'm missing of rows. tried agrep in r somehow not working , fuzzy matching quite slow. please suggest me way in r or python. appreciated!
in r, can use agrep fuzzy matching. can use max.distance parameter set maximum distance allowed match.
df1[sapply(df2$name, agrep, df1$name, max.distance = 0.2), ] # name count # 1 xxx yyyyyy bbb cccc 15 # 5 ccccccccc sss qqqq 2 # 4 22 cc pbc2 kmn3 b23 efgh 4 the data:
df1 <- read.table(text = "name count 'xxx yyyyyy bbb cccc' 15 'fffdd 444 ggg ' 20 'kkbbb ccc dd 29p' 5 '22 cc pbc2 kmn3 b23 efgh' 4 'ccccccccc sss qqqq' 2", header = true, stringsasfactors = false) df2 <- read.table(text = "name 'xxx yyyyyy bbb cccc' 'ccccccccc sss qqqq pppc' '22 cc pbc2 kmn3 b23,efgh'", header = true, stringsasfactors = false)
Comments
Post a Comment