regex - Partial Matching two data frames having a common column(by words) in R/Python -
i have 2 dataframes csv files df1
has more rows df2
:
df1
name count xxx yyyyyy bbb cccc 15 fffdd 444 ggg 20 kkbbb ccc dd 29p 5 22 cc pbc2 kmn3 b23 efgh 4 ccccccccc sss qqqq 2
df2
name xxx yyyyyy bbb cccc ccccccccc sss qqqq pppc 22 cc pbc2 kmn3 b23,efgh
i want partial matching(approximate/fuzzy matching) matching either first two/three words. output this:
output:
name count xxx yyyyyy bbb cccc 15 22 cc pbc2 kmn3 b23 efgh 4 ccccccccc sss qqqq 2
by trying exact matching, i'm missing of rows. tried agrep
in r somehow not working , fuzzy matching quite slow. please suggest me way in r or python. appreciated!
in r, can use agrep
fuzzy matching. can use max.distance
parameter set maximum distance allowed match.
df1[sapply(df2$name, agrep, df1$name, max.distance = 0.2), ] # name count # 1 xxx yyyyyy bbb cccc 15 # 5 ccccccccc sss qqqq 2 # 4 22 cc pbc2 kmn3 b23 efgh 4
the data:
df1 <- read.table(text = "name count 'xxx yyyyyy bbb cccc' 15 'fffdd 444 ggg ' 20 'kkbbb ccc dd 29p' 5 '22 cc pbc2 kmn3 b23 efgh' 4 'ccccccccc sss qqqq' 2", header = true, stringsasfactors = false) df2 <- read.table(text = "name 'xxx yyyyyy bbb cccc' 'ccccccccc sss qqqq pppc' '22 cc pbc2 kmn3 b23,efgh'", header = true, stringsasfactors = false)
Comments
Post a Comment