Not logged in » Login
May 15 2019

How to train Deep Neural Networks faster with HPC systems


As enterprise Artificial Intelligence grows both in terms of frequency and complexity, John Wagner examines the benefits of running deep neural network training on high performance computing architectures.

Thanks to somesignificant advances in the field of Artificial Intelligence (AI) over the last few years, the technology is rapidly becoming mainstream. The increased availability of parallel computing has contributed to this in combination with the powerful general-purpose graphics processing units (GPUs), multi-core processors and new deep neural networks (DNNs). These factors have enabled developers to tackle problems that were previously considered too complex to solve. And we’ve already seen the myriad use cases where AI is transforming industries - from the automation of quality control processes at wind turbine manufacturer Siemens Gamesa to supporting clinicians at San Carlos Clinical Hospital in Madrid flag likely health risks for patients. It is also used to model financial markets, automate many support center functions, and deliver extraordinary level of insights for every step of a shopper’s retail journey.

Many of these advances in AI technology have been due to the less-known but still essential companion to AI, the process of machine learning. It’s not very glamorous, feeding pictures of cats to a deep neural network (DNN), so that it will eventually be able to recognize a picture that is not of a cat – but of a badger, for example. But the process of teaching DNNs is absolutely essential – and the success of every AI system depends on what it learned in this phase.

DNNs are versatile and can be trained to recognize things such as images, spoken words or diseased cells – simply by feeding them a vast set of information. Advances in human experience and computer technology have enabled DNNs to develop in leaps and bounds over the past five years. Machine learning processes allow DNNs to build up a vast array of attributes, and eventually to identify new sets of objects they have not encountered before, simply by referring to the enormous “trained” network of information, or “intelligence”, if you will.

A common fallacy has been that it’s simple to train a DNN, all you need is a lot of technical resources to automate the process. However, this is a very analog way of teaching a DNN and there is a much faster way, through using self-learning.

There is of course an inherent danger here as the eventual success of a DNN depends entirely on the quality of the input it receives during its learning process. The key to self-learning is being able to manage the massive amounts of data needed for the training process. This is actually the most challenging phase for neural networks, particularly because of the large amount of processing power they require. Once training is under way, it is vital to make sure the process runs through to completion – in some cases, failure to do so can mean you need to start over. Also, the faster the computing the technology, the quicker you can teach your AI.

As a result, a cluster of machines using a high-performance computing (HPC) architecture is commonly used for learning, with training processes running in parallel across multiple servers. Using an HPC framework also means that it is easier to scale-out deep learning frameworks, enabling them to address even larger scale, more complex problems.

High performance computing has several characteristics that are particularly well-suited for AI frameworks including:

  1. Multi-node process parallelism
  2. Data access via a high-speed parallel file system with a single uniform namespace
  3. High-speed interconnect to improve inter-node process communication performance


The benefits of using an HPC-AI Reference Model

An HPC architecture, based on multiple conventional machines linked to collectively deliver advanced processing capability, makes it very easy to scale-out an existing environment. As a result, even a small initial installation can easily be expanded as demands for higher processing levels increase.

Utilizing this architecture for AI via an HPC-AI reference model brings benefits including: creating a proven platform based on optimized HPC technologies that deliver both high performance and an increased ability to scale-out processing capabilities in line with requirements. In addition, this approach leads to a solution that can be deployed as a central service, stand-alone departmental service or even as a small office service.

Fujitsu has put together a series of reference architectures and sample configurations based on either the 2U PRIMERGY CX400/CX2570 chassis/servers combination or the 2U RX2540 server. Both are specifically designed to support Deep Learning environments and support the world’s most advanced NVIDIA visual computing platform, designed to handle processing-heavy data samples such as large quantities of images, voice or video recordings to train the network.


Creating flexible, cost effective reference architectures for DNN training

Fujitsu has many years of experience in delivering both Integrated systems for HPC as well as in developing and implementing AI systems. This combination of skills means Fujitsu can create flexible, cost effective reference architectures for DNN training with the best possible price/ performance ratios. HPC AI reference models are based on the best architectural components of an HPC infrastructure in combination with the required AI frameworks. We’ve even produced a whitepaper which features multiple sample architectures for different usage scenarios.

Ultimately, workloads will only become more complex and time critical as enterprises deploy increasingly complex AI systems to improve operational efficiency, make faster, more informed decisions and innovate to create new products and services. Leveraging HPC resources opens up the possibility of parallel processing, extremely fast communication and high-speed storage access. What’s more, all resources can be expanded (scaled-out) over time as and when required, instead of needing to be replaced with new infrastructure.

The result is an agile system that effectively delivers AI and HPC workloads, leading to optimized total cost of ownership, improved return on investment and a faster time to result. If you’d like to find out more about deploying AI on HPC architecture, read the full White Paper: PRIMEFLEX for HPC - AI Reference Architecture

Manju Annie Oommen


About the Author:

Manju Annie Oommen

Sr. Manager – Product Marketing


Comments on this article

No comments yet.

Please Login to leave a comment.


Please login

Please log in with your Fujitsu Partner Account.


» Forgot password

Register now

If you do not have a Fujitsu Partner Account, please register for a new account.

» Register now