Spark - How to do summary statistics on SchemaRDD? -


i want calculate the summary statistics on log count per user, rdd used from:

val filerdd = sc.textfile("s3n://<bucket>/project/20141215/log_type1/log_type1.*.gz") val jsonrdd = sqlcontext.jsonrdd(filerdd)  rdd.registertemptable("log_type1") val result = sqlcontext.sql("select user_id, count(*) the_count log_type1 group user_id order the_count desc") 

how can apply statistics functionalities provided spark mllib on result? since log count of each user important, have summary of following form:

mean: 3.245 (user-id-abcdef) min: 1 (user-id-mmmnnnkkk) median: 15 (user-id-xyzrpg) max: 950 (user-id-123456789) 

how can can this? looks there's no maprdd in spark's api.


Comments

Popular posts from this blog

java - Plugin org.apache.maven.plugins:maven-install-plugin:2.4 or one of its dependencies could not be resolved -

Round ImageView Android -

How can I utilize Yahoo Weather API in android -