r - tidyr spread function generates sparse matrix when compact vector expected -
i'm learning dplyr, having come plyr, , want generate (per group) columns (per interaction) output of xtabs.
short summary: i'm getting
a b 1 na na 2
when wanted
a b 1 2
xtabs data looks this:
> xtabs(data=data.frame(p=c(f,t,f,t,f),a=c(f,f,t,t,t))) p false true false 1 2 true 1 1
now do(
wants it's data in data frames, this:
> xtabs(data=data.frame(p=c(f,t,f,t,f),a=c(f,f,t,t,t))) %>% as.data.frame p freq 1 false false 1 2 true false 1 3 false true 2 4 true true 1
now want single row output columns being interaction of levels. here's i'm looking for:
false_false true_true false_true true_false 1 1 2 1
but instead
> xtabs(data=data.frame(p=c(f,t,f,t,f),a=c(f,f,t,t,t))) %>% as.data.frame %>% unite(s,a,p) %>% spread(s,freq) false_false false_true true_false true_true 1 1 na na na 2 na 1 na na 3 na na 2 na 4 na na na 1
i'm misunderstanding here. i'm looking equivalent of reshape2's code here (using magrittr pipes consistency):
> xtabs(data=data.frame(p=c(f,t,f,t,f),a=c(f,f,t,t,t))) %>% as.data.frame %>% # can omitted. (safely??) melt %>% mutate(s=interaction(p,a),value=value) %>% dcast(na~s) using p, id variables na false.false true.false false.true true.true 1 na 1 1 2 1
(note na used here because don't have grouping variable in simplified example)
update - interestingly, adding single grouping column seems fix - why synthesise (presumably row_name) grouping column without me telling it?
> xtabs(data=data.frame(h="foo",p=c(f,t,f,t,f),a=c(f,f,t,t,t))) %>% as.data.frame %>% unite(s,a,p) %>% spread(s,freq) h false_false false_true true_false true_true 1 foo 1 1 2 1
this seems partial solution.
the key here spread
doesn't aggregate data.
hence, if hadn't used xtabs
aggregate first, doing this:
a <- data.frame(p=c(f,t,f,t,f),a=c(f,f,t,t,t), freq = 1) %>% unite(s,a,p) ## s freq ## 1 false_false 1 ## 2 false_true 1 ## 3 true_false 1 ## 4 true_true 1 ## 5 true_false 1 %>% spread(s, freq) ## false_false false_true true_false true_true ## 1 1 na na na ## 2 na 1 na na ## 3 na na 1 na ## 4 na na na 1 ## 5 na na 1 na
which wouldn't make sense other way (without aggregation).
this predictable based on file fill
parameter:
if there isn't value every combination of other variables , key column, value substituted.
in case, there aren't other variables combine key column. had there been, then...
b <- data.frame(p=c(f,t,f,t,f),a=c(f,f,t,t,t), freq = 1 , h = rep(c("foo", "bar"), length.out = 5)) %>% unite(s,a,p) b ## s freq h ## 1 false_false 1 foo ## 2 false_true 1 bar ## 3 true_false 1 foo ## 4 true_true 1 bar ## 5 true_false 1 foo > b %>% spread(s, freq) ## error: duplicate identifiers rows (3, 5)
...it fail, because can't aggregate rows 3 , 5 (because isn't designed to).
the tidyr
/dplyr
way group_by
, summarize
instead of xtabs
, because summarize
preserves grouping column, hence spread
can tell observations belong in same row:
b %>% group_by(h, s) %>% summarize(freq = sum(freq)) ## source: local data frame [4 x 3] ## groups: h ## ## h s freq ## 1 bar false_true 1 ## 2 bar true_true 1 ## 3 foo false_false 1 ## 4 foo true_false 2 b %>% group_by(h, s) %>% summarize(freq = sum(freq)) %>% spread(s, freq) ## source: local data frame [2 x 5] ## ## h false_false false_true true_false true_true ## 1 bar na 1 na 1 ## 2 foo 1 na 2 na
Comments
Post a Comment