Fujitsu
Not logged in » Login
X

Please login

Please log in with your Fujitsu Partner Account.

Login


» Forgot password

Register now

If you do not have a Fujitsu Partner Account, please register for a new account.

» Register now
Dec 06 2017

Combine the Power of HPC with the Flexibility of Big Data Applications: Fujitsu’s Reference Model for HPDA

Have you ever wondered what might happen once the ever-increasing amount of data stashed in today's storage systems hits upon a certain threshold and becomes too big for conventional analytics? Then you're probably on the same page as the researchers and engineers who have developed the discipline of High Performance Data Analytics, or HPDA for short, in recent years. This blog discusses some of HPDA's key concepts – and explains how Fujitsu's new PRIMEFLEX for HPC solution can help customers tackle the challenge of analyzing a ceaseless stream of data when retrieving real time information is critical to your business.

Since it was first coined in the 1990s, the term 'Big Data' has been used to describe data sets that were considered "too damn large to be handled with conventional means," as our informatics teacher in college used to put it. In essence, these data sets are defined by what Gartner and others have called the "3 Vs": volume (the sheer mass of information included), velocity (the speed at which it is collected and transferred), and finally variety (different types of structured and unstructured data obtained from different sources). An early example for the problems that could occur when companies faced a constantly increasing influx of information, but lacked capabilities to adequately evaluate it would be the results provided by pre-Google era search engines: Since ranking mechanisms were still immature, they often listed relevant and trivial sources side by side, leaving it to the users to sift through their pages and separate the useful from the junk. Today, the hardware platforms and algorithms that power services like Ask.com or DuckDuckGo are much more refined (though not infallible), and it's typically much easier to find reliable, factually correct information than it was before 1998.

However, while most search engine providers and global information services seem to handle Big Data reasonably well, the challenges of continuous and seemingly uncontrolled data growth are nowadays felt by organizations with a smaller reach. In other words, these days Big Data is pretty much everywhere, and it's only logical that companies want to put it to commercial use – even more so at a time when an already massive utilization of cloud services and a rising adoption of IoT devices promise to add even more, hitherto undetected intelligence to the mix. To achieve this, researchers and software engineers have developed the concept of High-Performance Data Analytics (HPDA) in recent years. HPDA essentially aims to unleash the power of HPC on typical Big Data operations, to help data centers overcome the speed and scalability issues that will inevitably occur when working with 'regular' platforms. Key segments that stand to benefit the most from HPDA include

  • E-Commerce and Retail: While not all firms in this field can be as big as Amazon or Walmart, even medium-sized online shops and local or regional chain stores often have hundreds of thousands, if not millions of valuable customer and financial data at their disposal. HPDA could help them identify hidden patterns in the demographics, buying preferences and habits of actual and potential customers, which in turn would enable them to develop and fine-tune the complex algorithms they use to conduct "affinity marketing."
  • Banking/Retail/E-Commerce: Similar mechanisms, such as graphical and semantic analysis, allow users to identify harmful or potentially harmful patterns whenever a customer tries to place an order or trigger a payment process. Real-time anomaly and fraud detection are thus much easier to implement, helping imperiled companies to avoid risks and save substantial amounts of money.
  • Healthcare: Hospitals and clinics could use HPDA to determine which therapy and/or medication are best suited for which patient – and which are a no-go due to individual dispositions. Likewise, medical researchers could use the analysis to create custom-made formulations and products to treat very specific conditions. As a result, the healthcare sector will be able to provide an increasingly personalized medicine.
  • Telecom carriers, cloud service providers, and IT departments: Practically all organizations that work these fields must adhere to contracts that include strict and complicated service level agreements. Consequently, they have to solve complex problems on a permanent or near-permanent basis, e.g. when deciding how to best route calls and/or data packages so that reach the desired recipient on time. The foreseeable growth in cloud and IoT deployments will only exacerbate the problem. HPDA capabilities could be very helpful in not only identifying potential bottlenecks and vulnerabilities, but also in defining and establishing policies and solutions that minimize or eliminate the risks of network congestion and hacker attacks.

Key Technologies in Big Data Scenarios
In most commercial environments, Big Data is typically handled by large arrays of conventional servers or server clusters that process incoming data in parallel. To do this effectively, IT departments will typically rely on open-source tools like Apache Hadoop and Apache Spark. Both are frameworks that cover different aspects of the tasks at hand:

  • Hadoop is a Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. Input data comes from the Hadoop Distributed File System (HDFS), which allows for rapid data transfer rates between nodes. Within the context of Big Data, it serves as an infrastructure layer.
  • Spark for its part is a framework for performing general data analytics on distributed computing clusters like Hadoop. It supports in-memory computations for increased speed and thus ramps up the pace of data analytics while at the same time broadening the set of workloads Hadoop can handle.

These great capabilities notwithstanding, both Hadoop and Spark are subject to their own constraints while dealing with high data velocity in combination with the requirement for on-time information.

  • Parallel file systems used in HPC, such as Intel's Lustre or Fujitsu's own FEFS, would be more effective and faster for random reading of small files. In this scenario, Hadoop can often reach its limits when it comes to scaling concurrent jobs while meeting individual high-speed performance requirements, and does not support the scalable network topologies that power HPC installations.
  • Spark on the other hand is often described as being 'heavy on memory resources,' meaning that it requires RAM capabilities most servers won't offer.

HPDA has been developed to obtain the best of both worlds by leveraging HPC architecture on Big Data applications when price per performance is critical. This goal served as a basis for our reference model.

Fujitsu's Solution: a Reference Model for HPDA
Fujitsu has extensive experience in creating platforms that are equally well-suited for Big Data and HPC environments. Therefore, it's only natural that we wanted to expand our portfolio and offer a solution for HPDA scenarios. But instead of simply bundling up the necessary hard- and software components into a neat package to be sold as a whole, we opted for a more flexible approach: Fujitsu's HPDA Reference Model provides a configuration which represents the requirements needed based on the data size you are dealing with. Each individual installation is essentially a co-creation, developed by Fujitsu engineers and the IT teams on site, and helps users integrate their analytic pipeline into the HPC workflow, orchestrate data staging and smart data movement between different software components, and still execute HPC production workloads.

In this context, our HPDA Reference Model represents the smart integration and performance advantages of an HPC infrastructure along with Big Data and data analytics technology. If there is an existing HPC infrastructure or FUJITSU Integrated System PRIMEFLEX for HPC running, the HPDA reference model can build on existing resources and be scaled according to specific needs. Alternatively, a new PRIMEFLEX for HPC/HPDA infrastructure may be built from scratch. In both cases, a traditional HPC cluster is augmented with Hadoop tools needed for Big Data and Data Analytics processing, enabling traditional HPC workloads to co-exist with HPDA processes. In addition, an HPC parallel file system is combined with an HDFS connector to enable HPDA applications to seamlessly access data alongside HPC applications. That way, it's possible to create agile systems that allow for fast/real-time execution of both HPC and Data Analytics workloads, leading to optimized TCO and improved ROI.

Integrated Solution Stack
To simplify the co-creation process, Fujitsu offers an integrated hard- and software solution stack that includes the components shown in Fig. 1.

 

Image

Fig. 1: Overview of the solution stack and components of PRIMEFLEX for HPC/HPDA

 

The stack includes:

  • One or more Fujitsu PRIMERGY RX2530 1U dual processor servers that act as compute nodes. Equipped with 24 DIMMs and broad support for Intel's new Xeon° Processor Scalable Family, the RX2530 can tackle a wide variety of workloads and scale to meet the most demanding processing-ad-memory requirements.
  • Each node comes with a local set of Intel SSDs to achieve the required storage performance. These SSDs work well with both standard and parallel file systems and support the processing of structured and unstructured data. The recommended SSD size-to-memory ratio is 3:1 and the ratio of SSD size per memory per core is 33.33:1 for optimal performance.
  • The parallel file system (PFS) is based on BeeGFS and, as outlined above, configured with an HDFS connector function to offer the best possible performance. Since data redundancy is a not a standard feature of the HPC PFS, we have included a permanent project storage layer within the reference architecture, so that data that must be retained can be copied from the local PFS to this layer.
  • Both HPC and HPDA workloads require high-speed networking capabilities. That's why the stack includes high speed interconnects based on InfiniBand and/or Omni-Path to ensure inter-node communication is maximized and data movement to the permanent project storage occurs at the highest throughput rates.
  • Cluster monitoring and management is handled by Fujitsu HPC software. The toolset simplifies installation and ongoing maintenance of the overall stack and includes two batch job managers (PBS Pro and SLURM).
  • Finally, FUJITSU's HPC Gateway combines the simplicity of web access with agile application workflows and the clarity of active reporting to tune business processes and achieve better project outcomes. The intuitive web GUI includes direct simulation monitoring, data access and collaboration. New and occasional as well as seasoned HPC users will find the interface highly effective for preparing, launching and monitoring their work.

For more information about our HPDA Reference Model, including benchmark results and configuration recommendations, please check out the related white paper.

Manju Oommen

 

About the Author:

Manju Oommen

Senior Manager – Global Product Marketing

SHARE

Comments on this article

No comments yet.

Please Login to leave a comment.