Fujitsu
Not logged in » Login
X

Please login

Please log in with your Fujitsu Partner Account.

Login


» Forgot password

Register now

If you do not have a Fujitsu Partner Account, please register for a new account.

» Register now
Nov 22 2017

Practical Tips for SPARC: Using Hadoop and Spark on SPARC Servers/Solaris Platform – Configuring Apache Spark™ (Part 1)

/data/www/ctec-live/application/public/media/images/blogimages/32136_Fujitsu_M10-1_front_view_3D_open_scr.jpg

In our previous 3-part blog, we explained how to configure a Hadoop Cluster Node environment. The following entry illustrates how to properly set up Apache Spark™ as part of the intended environment. Unfortunately, presenting our case in a single piece would likely provoke many TL;DR reactions, so we opted for another mini-series. Below is part 1; parts 2 and 3 will follow over the next two weeks.

Advantages of Running Apache Spark™ on SPARC Servers/Solaris
While Apache Spark™ is well-suited for running in a distributed cluster configuration, such a setup requires lots of memory to ensure efficient usage. Fujitsu's SPARC Servers offer customers tangible benefits, because they can be installed at a relatively low price point and provide large amounts of memory even in their initial configurations. Moreover, the modular design of SPARC servers makes it possible to create even larger memory capacities. 

System Configuration
Although there are several ways to configure a Spark cluster, we will use Hadoop YARN, which is the most frequently used method. To begin, a Spark cluster environment should be configured as an OVM guest domain, which is then added to the Hadoop cluster environment we wrote about in the previous blog. The version of Spark should be 2.1.0 or later. Using a prior version may cause fatal errors to occur on the SPARC platform. (Since the Spark-Python interface will cause errors in non-global zone environments, a non-global zone should not be used.) Below is an illustration of a possible setup.

 

Image

 

In our example, the Spark™ environment is installed on physical node #3, but other nodes should work equally well. Moreover, administrators may also set up a single node environment, as we explained at the start of this series back in early September. The role of each process in the illustration is as follows:

  • Driver Program is the main component within Spark. It acts as a master during distributed processing, and generates "Spark Context" objects that connect to the Spark environment.
  • Executor is a distributed computation process that runs a "Task," which is the smallest unit of computation. A node that runs Executor is called a "Worker Node."
  • Resource Manager is the cluster management process that assigns requests from the Driver Program to specific Worker Nodes. In case of failure, it fails over to ctrl-node2.
  • Node Manager serves as some kind of 'orchestrator' that cooperates with the Resource Manager and controls the necessary number of Executors.

To produce results on our sample installation, we also need R, an open-source programming language and software environment that was specifically developed for statistical computing and graphics and is maintained by the R Project. While there are pre-compiled versions available (e.g. from Oracle), we will have to compile our own version from the R source code for the purposes laid out in this blog. Liblzma and libiconv, which are essential libraries for R, should also be installed together. When compiling R, a faster binary module can be created by specifying the SPARC64X optimized option.

The test configuration is largely identical with the one mentioned in our previous postings:

  • Server: Fujitsu M10-1
  • OS: Solaris 11.3 SRU 10.7
  • Java: JDK 1.8.0_102
  • IDE: Oracle Developer Studio 12.5
  • Other software: Hadoop 2.7.3, Spark 2.1.0, R 3.2.5, libiconv 1.1.4, and liblzma 5.2.2

Configuring Virtual Machines – Guest Domain Installation
Add a guest domain to an arbitrary physical node. The guest domain name should be "apl-node". Allocate as much memory as possible.

Configuring Virtual Machines – Required Packages Installation
Since Spark provides a Python interface, Python library packages should be installed. While Solaris 11.3 includes multiple Python versions, we will only use Python 2.7.9 in our sample installation. Python library modules such as numpy, which is frequently used by Spark sample programs, should be installed from the OS repository rather than from the Internet. Please also use the pip package manager to prevent library inconsistency errors. If you are installing Python library modules, related packages should be installed together.

Performance libraries such as BLAS and LAPACK are also part of the package. These were formerly distributed as components of the SUN compiler package, but nowadays they are parts of the Solaris OS repository. Therefore it is mandatory to install them with a 'pkg install' command during Oracle Developer Studio installation.

# pkg install developer/java/jdk-7
# pkg install runtime/python-27
# pkg install library/python/*27
# pkg install developer-studio-utilities

Configuring Virtual Machines – Editing /etc/inet/hosts
In the next step, edit /etc/inet/hosts. All IP addresses of the related nodes should be added. Additionally, the "apl-node" IP addresses should be added to /etc/inet/hosts on the other nodes in this cluster configuration.

::1 localhost
127.0.0.1 localhost loghost
xxx.xxx.xxx.xxx apl-node1 apl-node.local
xxx.xxx.xxx.xxx ctrl-node1
xxx.xxx.xxx.xxx ctrl-node2
xxx.xxx.xxx.xxx jrnl-node
xxx.xxx.xxx.xxx data-node1
xxx.xxx.xxx.xxx data-node2

Configuring Virtual Machines – Adding Spark/Hadoop Users and Group
Add the Hadoop group ID as "hadoop". The group ID of all additional Hadoop users should be "hadoop".

# groupadd -g 200 hadoop

Add the user ID for running "MapReduce" job as "spark", and set the password.

# useradd -u 101 -m -g hadoop spark
# passwd spark

This brings us to the end of today's entry. We'll continue with the part on required software installation at the beginning of next week.

Disclaimer
The information contained in this blog is for general information purposes only. While we endeavor to keep the information up-to-date and correct through testing on a practical system, we make no warranties of any kind about the completeness, accuracy, reliability, suitability or availability. Any reliance you place on such information is strictly at your own risk. The information in this blog is subject to change without notice.

Shinichiro Asai

 

About the Author:

Shinichiro Asai

Technical Sales Support, Platform Solution Unit, FUJITSU Japan

SHARE

Comments on this article

No comments yet.

Please Login to leave a comment.