Spark - How to do summary statistics on SchemaRDD? -

- August 15, 2010

i want calculate the summary statistics on log count per user, rdd used from:

val filerdd = sc.textfile("s3n://<bucket>/project/20141215/log_type1/log_type1.*.gz") val jsonrdd = sqlcontext.jsonrdd(filerdd)  rdd.registertemptable("log_type1") val result = sqlcontext.sql("select user_id, count(*) the_count log_type1 group user_id order the_count desc")

how can apply statistics functionalities provided spark mllib on result? since log count of each user important, have summary of following form:

mean: 3.245 (user-id-abcdef) min: 1 (user-id-mmmnnnkkk) median: 15 (user-id-xyzrpg) max: 950 (user-id-123456789)

how can can this? looks there's no maprdd in spark's api.

Search This Blog

Deter

Spark - How to do summary statistics on SchemaRDD? -

Comments

Post a Comment

Popular posts from this blog

java - Unable to make sub reports with Jasper -

scala - play framework: Modules were resolved with conflicting cross-version suffixes -

Passing Variables from AngelScript to C++ -