hadoop yarn single node performance tuning -


i have hadoop 2.5.2 single mode installation on ubuntu vm, is: 4-core, 3ghz per core; 4g memory. vm not production, demo , learning.

then, wrote vey simple map-reduce application using python, , use application process 49 xmls. these xml files small-size, hundreds of lines each. so, expected quick process. but, big22 surprise me, took more 20 minutes finish job (the output of job correct.). below output metrics :

14/12/15 19:37:55 info client.rmproxy: connecting resourcemanager @ /0.0.0.0:8032
14/12/15 19:37:57 info client.rmproxy: connecting resourcemanager @ /0.0.0.0:8032
14/12/15 19:38:03 info mapred.fileinputformat: total input paths process : 49
14/12/15 19:38:06 info mapreduce.jobsubmitter: number of splits:49
14/12/15 19:38:08 info mapreduce.jobsubmitter: submitting tokens job: job_1418368500264_0005
14/12/15 19:38:10 info impl.yarnclientimpl: submitted application application_1418368500264_0005
14/12/15 19:38:10 info mapreduce.job: running job: job_1418368500264_0005
14/12/15 19:38:59 info mapreduce.job: job job_1418368500264_0005 running in uber mode : false
14/12/15 19:38:59 info mapreduce.job: map 0% reduce 0%
14/12/15 19:39:42 info mapreduce.job: map 2% reduce 0%
14/12/15 19:40:05 info mapreduce.job: map 4% reduce 0%
14/12/15 19:40:28 info mapreduce.job: map 6% reduce 0%
14/12/15 19:40:49 info mapreduce.job: map 8% reduce 0%
14/12/15 19:41:10 info mapreduce.job: map 10% reduce 0%
14/12/15 19:41:29 info mapreduce.job: map 12% reduce 0%
14/12/15 19:41:50 info mapreduce.job: map 14% reduce 0%
14/12/15 19:42:08 info mapreduce.job: map 16% reduce 0%
14/12/15 19:42:28 info mapreduce.job: map 18% reduce 0%
14/12/15 19:42:49 info mapreduce.job: map 20% reduce 0%
14/12/15 19:43:08 info mapreduce.job: map 22% reduce 0%
14/12/15 19:43:28 info mapreduce.job: map 24% reduce 0%
14/12/15 19:43:48 info mapreduce.job: map 27% reduce 0%
14/12/15 19:44:09 info mapreduce.job: map 29% reduce 0%
14/12/15 19:44:29 info mapreduce.job: map 31% reduce 0%
14/12/15 19:44:49 info mapreduce.job: map 33% reduce 0%
14/12/15 19:45:09 info mapreduce.job: map 35% reduce 0%
14/12/15 19:45:28 info mapreduce.job: map 37% reduce 0%
14/12/15 19:45:49 info mapreduce.job: map 39% reduce 0%
14/12/15 19:46:09 info mapreduce.job: map 41% reduce 0%
14/12/15 19:46:29 info mapreduce.job: map 43% reduce 0%
14/12/15 19:46:49 info mapreduce.job: map 45% reduce 0%
14/12/15 19:47:09 info mapreduce.job: map 47% reduce 0%
14/12/15 19:47:29 info mapreduce.job: map 49% reduce 0%
14/12/15 19:47:49 info mapreduce.job: map 51% reduce 0%
14/12/15 19:48:08 info mapreduce.job: map 53% reduce 0%
14/12/15 19:48:28 info mapreduce.job: map 55% reduce 0%
14/12/15 19:48:48 info mapreduce.job: map 57% reduce 0%
14/12/15 19:49:09 info mapreduce.job: map 59% reduce 0%
14/12/15 19:49:29 info mapreduce.job: map 61% reduce 0%
14/12/15 19:49:55 info mapreduce.job: map 63% reduce 0%
14/12/15 19:50:23 info mapreduce.job: map 65% reduce 0%
14/12/15 19:50:53 info mapreduce.job: map 67% reduce 0%
14/12/15 19:51:22 info mapreduce.job: map 69% reduce 0%
14/12/15 19:51:50 info mapreduce.job: map 71% reduce 0%
14/12/15 19:52:18 info mapreduce.job: map 73% reduce 0%
14/12/15 19:52:48 info mapreduce.job: map 76% reduce 0%
14/12/15 19:53:18 info mapreduce.job: map 78% reduce 0%
14/12/15 19:53:48 info mapreduce.job: map 80% reduce 0%
14/12/15 19:54:18 info mapreduce.job: map 82% reduce 0%
14/12/15 19:54:48 info mapreduce.job: map 84% reduce 0%
14/12/15 19:55:19 info mapreduce.job: map 86% reduce 0%
14/12/15 19:55:48 info mapreduce.job: map 88% reduce 0%
14/12/15 19:56:16 info mapreduce.job: map 90% reduce 0%
14/12/15 19:56:44 info mapreduce.job: map 92% reduce 0%
14/12/15 19:57:14 info mapreduce.job: map 94% reduce 0%
14/12/15 19:57:45 info mapreduce.job: map 96% reduce 0%
14/12/15 19:58:15 info mapreduce.job: map 98% reduce 0%
14/12/15 19:58:46 info mapreduce.job: map 100% reduce 0%
14/12/15 19:59:20 info mapreduce.job: map 100% reduce 100%
14/12/15 19:59:28 info mapreduce.job: job job_1418368500264_0005 completed successfully
14/12/15 19:59:30 info mapreduce.job: counters: 49
file system counters
file: number of bytes read=17856
file: number of bytes written=5086434
file: number of read operations=0
file: number of large read operations=0
file: number of write operations=0
hdfs: number of bytes read=499030
hdfs: number of bytes written=10049
hdfs: number of read operations=150
hdfs: number of large read operations=0
hdfs: number of write operations=2
job counters
launched map tasks=49
launched reduce tasks=1
data-local map tasks=49
total time spent maps in occupied slots (ms)=8854232
total time spent reduces in occupied slots (ms)=284672
total time spent map tasks (ms)=1106779
total time spent reduce tasks (ms)=35584
total vcore-seconds taken map tasks=1106779
total vcore-seconds taken reduce tasks=35584
total megabyte-seconds taken map tasks=1133341696
total megabyte-seconds taken reduce tasks=36438016
map-reduce framework
map input records=9352
map output records=296
map output bytes=17258
map output materialized bytes=18144
input split bytes=6772
combine input records=0
combine output records=0
reduce input groups=53
reduce shuffle bytes=18144
reduce input records=296
reduce output records=52
spilled records=592
shuffled maps =49
failed shuffles=0
merged map outputs=49
gc time elapsed (ms)=33590
cpu time spent (ms)=191390
physical memory (bytes) snapshot=13738057728
virtual memory (bytes) snapshot=66425016320
total committed heap usage (bytes)=10799808512
shuffle errors
bad_id=0
connection=0
io_error=0
wrong_length=0
wrong_map=0
wrong_reduce=0
file input format counters
bytes read=492258
file output format counters
bytes written=10049
14/12/15 19:59:30 info streaming.streamjob: output directory: /data_output/sb50projs_1_output

as newbie hadoop, crazy unreasonable performance, have several questions:

  1. how configure hadoop/yarn/mapreduce make whole environment more convenient trial usage?

i understand hadoop designed huge-data , big files. trial environment, files small , data limited, default configuration items should change? have changed "dfs.blocksize" of hdfs-site.xml smaller value match small files, seems no big enhancements. know there jvm configuration items in yarn-site.xml , mapred-site.xml, not sure how adjust them.

  1. how read hadoop logs

under logs folder, there separate log files nodemanager/resourcemanager/namenode/datanode. tried read these files understand how 20 minutes spent during process, it's not easy newbie me. wonder there tool/ui me analyze logs.

  1. basic performance tuning tools

actually have googled around question, , got bunch of names ganglia/nagios/vaidya/ambari. want know, tool best analyse issue , "why took 20 minutes such simple job?".

  1. big number of hadoop processes

even if there no job running on hadoop, found around 100 hadoop processes on vm, below (i using htop, , sort result memory). normal hadoop ? or incorrect environment configuration?

enter image description here

  1. you don't have change anything.

the default configuration done small environment. may change if grow environment. ant there lot of params , lot of time fine tuning.

but admit configuration smaller usual ones tests.

  1. the log have read isn't services ones job ones. find them in /var/log/hadoop-yarn/containers/

if want better view of mr, use web interface on http://127.0.0.1:8088/. see job's progression in real time.

  1. imo, basic tuning = use hadoop web interfaces. there plenty available natively.

  2. i think find problem. can nomal, or not.

but quickly, yarn launch mr use available memory :

  • available memory set in yarn-site.xml : yarn.nodemanager.resource.memory-mb (default 8 gio).
  • memory task defined in mapred-site.xml or in task property : mapreduce.map.memory.mb (default 1536 mio)

so :

  1. change available memory nodemanager (to 3gio, in order let 1 gio system)
  2. change memory available hadoop services (-xmx in hadoop-env.sh, yarn-env.sh) (system + each hadoop services (namenode / datanode / ressourcemanager / nodemanager) < 1 gio.
  3. change memory map tasks (512 mio ?). lesser is, more task can executed in same time.
  4. change yarn.scheduler.minimum-allocation-mb 512 in yarn-site.xml allow mappers less 1 gio of memory.

i hope you.


Comments

Popular posts from this blog

java - Plugin org.apache.maven.plugins:maven-install-plugin:2.4 or one of its dependencies could not be resolved -

Round ImageView Android -

How can I utilize Yahoo Weather API in android -