Spark - How to do summary statistics on SchemaRDD? -
i want calculate the summary statistics on log count per user, rdd used from:
val filerdd = sc.textfile("s3n://<bucket>/project/20141215/log_type1/log_type1.*.gz") val jsonrdd = sqlcontext.jsonrdd(filerdd) rdd.registertemptable("log_type1") val result = sqlcontext.sql("select user_id, count(*) the_count log_type1 group user_id order the_count desc")
how can apply statistics functionalities provided spark mllib on result
? since log count of each user important, have summary of following form:
mean: 3.245 (user-id-abcdef) min: 1 (user-id-mmmnnnkkk) median: 15 (user-id-xyzrpg) max: 950 (user-id-123456789)
how can can this? looks there's no maprdd
in spark's api.
Comments
Post a Comment