Fujitsu
Not logged in » Login
X

Please login

Please log in with your Fujitsu Partner Account.

Login


» Forgot password

Register now

If you do not have a Fujitsu Partner Account, please register for a new account.

» Register now
Sep 04 2017

Practical Tips for SPARC: Using Hadoop and Spark on SPARC Servers/Solaris Platform – Configuring Hadoop Single Node Environments (Part 2)

/data/www/ctec-live/application/public/media/images/blogimages/32136_Fujitsu_M10-1_front_view_3D_open_scr.jpg

In part 1 of this blog series, I provided instructions on configuring a Spark cluster environment using Hadoop YARN running on highly-reliable Fujitsu M10/Fujitsu SPARC M12 servers. By taking advantage of enterprise-grade, highly reliable hardware, customers can avoid a Single Point of Failure (SPOF) in a Hadoop environment. In part 2, we provide step-by-step instructions on how to configure a Hadoop Single Node environment and will continue in subsequent blogs.

Creating Directories for Hadoop
Create directories for storing Hadoop data. Each directory must be created as a ZFS file system for the future production operation.

Create a directory for log files of "hdfs" users.
# zfs create -p hdpool/log/hdfs
# chown hdfs:hadoop /hdpool/log/hdfs

Create a directory for log files of "yarn" users.
# zfs create -p hdpool/log/yarn
# chown yarn:hadoop /hdpool/log/yarn

Create a directory for log files of "mapred" users.
# zfs create -p hdpool/log/mapred
# chown mapred:hadoop /hdpool/log/mapred

Create a directory for the HDFS metadata.
# zfs create -p hdpool/data/1/dfs/nn
# chmod 700 /hdpool/data/1/dfs/nn
# chown -R hdfs:hadoop /hdpool/data/1/dfs/nn

Create a directory for the HDFS data blocks.
# zfs create -p hdpool/data/1/dfs/dn
# chown -R hdfs:hadoop /hdpool/data/1/dfs/dn

Create directories for "yarn" user.
# zfs create -p hdpool/data/1/yarn/local
# zfs create -p hdpool/data/1/yarn/logs
# chown -R yarn:hadoop /hdpool/data/1/yarn/local
# chown -R yarn:hadoop /hdpool/data/1/yarn/logs

Create runtime directories yarn for "yarn" user.
# zfs create -p hdpool/run/yarn
# chown yarn:hadoop /hdpool/run/yarn
# zfs create -p hdpool/run/hdfs
# chown hdfs:hadoop /hdpool/run/hdfs
# zfs create -p hdpool/run/mapred
# chown mapred:hadoop /hdpool/run/mapred

Create a directory for temporary data.
# zfs create -p hdpool/tmp

Confirm that all the directories have been created, as follows:
# zfs list -r hdpool

When the result is as shown below, they have been created successfully.
NAME USED AVAIL REFER MOUNTPOINT
hdpool 5.33M 11.7G 352K /hdpool
hdpool/data 2.36M 11.7G 304K /hdpool/data
hdpool/data/1 2.06M 11.7G 320K /hdpool/data/1
hdpool/data/1/dfs 896K 11.7G 320K /hdpool/data/1/dfs
hdpool/data/1/dfs/dn 288K 11.7G 288K /hdpool/data/1/dfs/dn
hdpool/data/1/dfs/nn 288K 11.7G 288K /hdpool/data/1/dfs/nn
hdpool/data/1/yarn 896K 11.7G 320K /hdpool/data/1/yarn
hdpool/data/1/yarn/local 288K 11.7G 288K /hdpool/data/1/yarn/local
hdpool/data/1/yarn/logs 288K 11.7G 288K /hdpool/data/1/yarn/logs
hdpool/log 1.17M 11.7G 336K /hdpool/log
hdpool/log/hdfs 288K 11.7G 288K /hdpool/log/hdfs
hdpool/log/mapred 288K 11.7G 288K /hdpool/log/mapred
hdpool/log/yarn 288K 11.7G 288K /hdpool/log/yarn
hdpool/run 1.17M 11.7G 336K /hdpool/run
hdpool/run/hdfs 288K 11.7G 288K /hdpool/run/hdfs
hdpool/run/mapred 288K 11.7G 288K /hdpool/run/mapred
hdpool/run/yarn 288K 11.7G 288K /hdpool/run/yarn
hdpool/tmp 288K 11.7G 288K /hdpool/tmp
In this blog, the directories for log files are compressed by using ZFS function. As the size of log files increase, a disk space problem may be created. In order to prevent the processes from abnormally terminating due to failure to write to the disk, compress the log file directories as follows:
# zfs set compression=lz4 hdpool/log

Setting Hadoop Configuration Files
Change the current directory to the directory of Hadoop configuration files.(/opt/hadoop/etc/hadoop)

Add the followings to "hadoop-env.sh".
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export JAVA_HOME=/usr/java
export HADOOP_LOG_DIR=/hdpool/log/hdfs

Add the followings to "yarn-env.sh".
export JAVA_HOME=/usr/java
export YARN_LOG_DIR=/hdpool/log/yarn
export HADOOP_HOME=/opt/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

Add the followings to "mapred-env.sh".
export JAVA_HOME=/usr/java
export HADOOP_MAPRED_LOG_DIR=/hdpool/log/mapred
export HADOOP_MAPRED_IDENT_STRING=mapred

Edit "slaves" and write hostname managed by "DataNode".
m10spark

Edit "core-site.xml" as follows:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://m10spark</value>
</property>
</configuration>

Edit "mapred-site.xml" as follows:
<?xml version="1.0"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>m10spark:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>m10spark:19888</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
</property>
</configuration>

Edit "yarn-site.xml" as follows:
<?xml version="1.0"?>
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>m10spark</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>file:///hdpool/data/1/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>file:///hdpool/data/1/yarn/logs</value>
</property>
<property>
<name>yarn.log.aggregation.enable</name>
<value>true</value>
</property>
<property>
<description>Where to aggregate logs</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>hdfs://hdpool/log/hadoop-yarn/apps</value>
</property>
</configuration>

Edit "hdfs-site.xml" as follows:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>/hdpool/data/1/dfs/dn</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/hdpool/data/1/dfs/nn</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions.supergroup</name>
<value>hadoop</value>
</property>
</configuration>

Starting Hadoop Processes
WARNING : Although the message below may appear, it can be ignored, because Hadoop processes will not be affected.
xx/xx/xx xx:xx:xx WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Format Hadoop file system, first.
# su - hdfs -c 'hdfs namenode -format'
The message below shows that it is formatted successfully.
16/10/24 08:17:36 INFO common.Storage: Storage directory /hdpool/data/1/dfs/nn has been successfully formatted.

Start "NameNode".
# su - hdfs -c 'hadoop-daemon.sh start namenode'
Check "NameNode" status.
# /usr/java/bin/jps | grep NameNode
The message below indicates that it started successfully.
25852 NameNode

Start "ResourceManager".
# su - yarn -c 'yarn-daemon.sh start resourcemanager'
Check "ResourceManager" status.
# /usr/java/bin/jps | grep ResourceManager
The message below indicates that it started successfully.
25982 ResourceManager

Start "DataNode" and "NodeManager".
# su - hdfs -c 'hadoop-daemon.sh start datanode'
# su - yarn -c 'yarn-daemon.sh start nodemanager'
Check "DataNode" status.
# /usr/java/bin/jps | grep DataNode
The message below indicates that it started successfully.
26240 DataNode

Check "NodeManager" status.
# /usr/java/bin/jps | grep NodeManager
The message below indicates that it started successfully.
26299 NodeManager

Start "JobHistoryServer".
# su - mapred -c 'mr-jobhistory-daemon.sh start historyserver'
Check "JobHistoryServer" status.
# /usr/java/bin/jps | grep JobHistoryServer
The message below indicates that it started successfully.
26573 JobHistoryServer

Create HDFS directories.
# (Login to name-node1 as "hdfs" user)
# hadoop fs -mkdir /tmp
# hadoop fs -chmod -R 1777 /tmp
# hadoop fs -mkdir /data
# hadoop fs -mkdir /data/history
# hadoop fs -chmod -R 1777 /data/history
# hadoop fs -chown yarn /data/history
# hadoop fs -mkdir /var
# hadoop fs -mkdir /var/log
# hadoop fs -mkdir /var/log/hadoop-yarn
# hadoop fs -chown yarn:mapred /var/log/hadoop-yarn
# hadoop fs -mkdir /data/spark
# hadoop fs -chown spark /data/spark

Confirm that the HDFS directories can be accessed from all the Hadoop nodes, as follows:
# su - hdfs -c 'hadoop fs -ls -R /'
Confirm that the directories below are displayed on all nodes.
drwxr-xr-x - hdfs hadoop 0 2016-10-25 07:32 /data
drwxrwxrwt - yarn hadoop 0 2016-10-25 07:26 /data/history
drwxr-xr-x - spark hadoop 0 2016-10-25 07:32 /data/spark
drwxrwxrwt - hdfs hadoop 0 2016-10-24 16:08 /tmp
drwxr-xr-x - hdfs hadoop 0 2016-10-25 07:29 /var
drwxr-xr-x - hdfs hadoop 0 2016-10-25 07:29 /var/log
drwxr-xr-x - yarn mapred 0 2016-10-25 07:29 /var/log/hadoop-yarn

Testing
Connect to http://<target IP address>:50070/ with a web browse. The page shown below means that Hadoop is working successfully.

Run the Hadoop sample application. When it ends successfully, all the procedures are completed.
# su - spark
$ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar pi 10 20
Number of Maps = 10
Samples per Map = 20
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
16/11/16 13:47:08 INFO client.RMProxy: Connecting to ResourceManager at m10spark/10.20.98.122:8032
16/11/16 13:47:10 INFO input.FileInputFormat: Total input paths to process : 10
16/11/16 13:47:10 INFO mapreduce.JobSubmitter: number of splits:10
16/11/16 13:47:11 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1479271455092_0001
16/11/16 13:47:11 INFO impl.YarnClientImpl: Submitted application application_1479271455092_0001
16/11/16 13:47:11 INFO mapreduce.Job: The url to track the job: http://m10spark:8088/proxy/application_1479271455092_0001/
16/11/16 13:47:11 INFO mapreduce.Job: Running job: job_1479271455092_0001
16/11/16 13:47:25 INFO mapreduce.Job: Job job_1479271455092_0001 running in uber mode : false
16/11/16 13:47:25 INFO mapreduce.Job: map 0% reduce 0%
16/11/16 13:47:51 INFO mapreduce.Job: map 10% reduce 0%
16/11/16 13:47:53 INFO mapreduce.Job: map 20% reduce 0%
16/11/16 13:47:54 INFO mapreduce.Job: map 60% reduce 0%
16/11/16 13:48:14 INFO mapreduce.Job: map 80% reduce 0%
16/11/16 13:48:15 INFO mapreduce.Job: map 100% reduce 0%
16/11/16 13:48:17 INFO mapreduce.Job: map 100% reduce 100%
16/11/16 13:48:18 INFO mapreduce.Job: Job job_1479271455092_0001 completed successfully
16/11/16 13:48:18 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=226
FILE: Number of bytes written=1311574
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2590
HDFS: Number of bytes written=215
HDFS: Number of read operations=43
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Job Counters
Launched map tasks=10
Launched reduce tasks=1
Data-local map tasks=10
Total time spent by all maps in occupied slots (ms)=241698
Total time spent by all reduces in occupied slots (ms)=20138
Total time spent by all map tasks (ms)=241698
Total time spent by all reduce tasks (ms)=20138
Total vcore-milliseconds taken by all map tasks=241698
Total vcore-milliseconds taken by all reduce tasks=20138
Total megabyte-milliseconds taken by all map tasks=247498752
Total megabyte-milliseconds taken by all reduce tasks=20621312
Map-Reduce Framework
Map input records=10
Map output records=20
Map output bytes=180
Map output materialized bytes=280
Input split bytes=1410
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=280
Reduce input records=20
Reduce output records=0
Spilled Records=40
Shuffled Maps =10
Failed Shuffles=0
Merged Map outputs=10
GC time elapsed (ms)=2760
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=2024275968
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1180
File Output Format Counters
Bytes Written=97
Job Finished in 69.642 seconds
Estimated value of Pi is 3.12000000000000000000

Stay tuned for next blog, which will cover how to configure a Hadoop cluster environment on Oracle VM for SPARC. I welcome any comments and experiences you may have.

 

Shinichiro Asai

 

About the Author:

Shinichiro Asai

Technical Sales Support, Platform Solution Unit, FUJITSU Japan

SHARE

Comments on this article

No comments yet.

Please Login to leave a comment.