The New Hardware That Will Power the Next Phase of HPC Growth

It’s been an eventful week for the high performance computing community, as the industry’s leading vendors unveiled the hardware that will power the sector’s next phase of growth. There were highly-anticipated releases from HPC stalwarts NVIDIA and Intel, as well as the first major supercomputing project using ARM processors. The biggest surprise was the massive display of computing power from China’s Sunway Taihulight system, the new champion on the Top500 list of the world’s most powerful supercomputers.

The hardware announced at the International Supercomputing Conference (ISC 16) in Frankfurt lays the groundwork for the next phase of growth in high performance computing. The new gear feeds a growing appetite for horsepower for artificial intelligence initiatives in Silicon Valley, as well as the U.S. supercomputing sector’s push toward “exascale” computing. The announcements highlight the vendors’ different approaches to improved performance, which in turn offers an array of choices to end users.

China Extends its Dominance

After several years of relative stability, the Top500 saw a major disruption with the debut of Sunway Taihulight, which clocked an amazing 93 petaflops per second on the LINPACK benchmark. That’s nearly three times as powerful as defending champ Tianhe-2 (33.9 petaflop/s) and the top U.S. system, Oak Ridge Laboratory’s Titan, which hit 17.6 on LINPACK.

The technology powering Sunway TaihuLight system includes a new ShenWei processor and custom interconnect, both of which were developed in China. The system resides at the National Supercomputing Center in Wuxi, China. The results showcased the advances China has made in HPC technology since April 2015, when the U.S. instituted an embargo on sales to China of high-end processors like Intel’s Xeon Phi. The system uses about 15.5 megawatts of power, which places it among the most efficient supercomputers on a power per watt basis.

“It is not just a stunt machine, and these results are impressive and should be taken seriously,” said Jack Dongarra, a leading HPC researcher and co-founder of the Top500 (See The Next Platform for a deep dive into the technical details of the Sunay Taihulight, including background from Dongarra).

The new Top500 list marks the first time that the U.S is not home to the largest number of systems. With a surge in industrial and research installations registered over the last few years, China leads with 167 systems and the U.S. is second with 165. China also leads the performance category, thanks to the No. 1 and No. 2 systems.

Intel Phi Production Launch

Intel unveiled the next iteration of its Xeon Phi technology for HPC. Xeon Phi was introduced as a coprocessor to accerleate workloads running on Xeon CPUs. At ISC 16, Intel announced the general availability of the Intel Phi processor (also known by the codename Knights Landing), which had previously been limited to select customers

“We’ve been shipping in volume for months now, but most of our inventory was already spoken for,” said Charles Wuischpard, the General Manager of Intel’s HPC Platform Group.

Intel’s overview of its Intel Xeon Phi Processor, which has entered general availability. (Source: Intel)

Intel expects to sell 100,000 units of Xeon Phi this year, including adoption by major supercomputing labs, including Los Alamos National Labs, Argonne, Lawrence Berkeley, Sandia and the Texas Advanced Computing Center (TACC).

While there’s high anticipation for the Xeon Phi processors, Intel is positioning it as part of its Intel Scalable System Framework, which also includes the its Omni-Path Fabric interconnect and HPC Orchestrator software.

“We’re looking at the market needs on a system level more than a CPU level,” said Wuischpard, who emphasized the versatility of Xeon Phi to provide significantly improved performance on a range of HPC workloads in finance, life sciences, visualization and energy industry applications.

What remains to be seen is how Xeon Phi will fare in the market for machine learning workloads, where large players are pursuing other approaches, with Facebook using GPUs to power its Big Sure machine learning appliance, while Google developed a custom ASIC.

“We believe machine learning is an important new workload, but it’s one of many workloads we need to address,” said Wuischpard. “We think Xeon Phi will be a great solution. We believe we’re going to get faster, more reliable results than GPU. We also believe the vast majority of users are seeking a balanced system for many workloads.”

Wuischpard noted that machine learning projects like Google’s are highly customized, and not easily adaptable for other workloads.

“There’s a lot going on in machine learning and AI,” he said. “Some people come at it from a custom ASIC perspective. Some are for very specific use cases. You always have to weigh general purpose use versus specific use. For those who are not able to build custom ASICs, FPGAs would be the next option, and we’re working on that front.”

NVIDIA Debuts New GPU

Intel’s primary rival is NVIDIA, which offers a different vision: accelerating CPU-based HPC workloads with graphics processing units (GPUs). At ISC16, NVIDIA announced its newest GPU hardware, the Tesla P100 accelerator for PCIe servers, which offers major performance and efficiency gains over earlier GPUs.

NVIDIA’s Tesla P100 GPU accelerator for PCI-enabled servers. (Photo: NVIDIA)

GPU-accelerated computing offloads compute-intensive portions of the application to the GPU, while the remainder of the code still runs on the CPU.

NVIDIA says the P100 enables the creation of “super nodes” that can provide throughput equivalent to 32 commodity CPU-based nodes, with meaningfully lower capital and operational costs.

“Accelerated computing is the only path forward to keep up with researchers’ insatiable demand for HPC and AI supercomputing,” said Ian Buck, vice president of accelerated computing at NVIDIA. “Deploying CPU-only systems to meet this demand would require large numbers of commodity compute nodes, leading to substantially increased costs without proportional performance gains. Dramatically scaling performance with fewer, more powerful Tesla P100-powered nodes puts more dollars into computing instead of vast infrastructure overhead.”

The new accelerators will power a pending upgrade to Europe’s fastest supercomputer, the Piz Daint system at the Swiss National Supercomputing Center in Lugano, Switzerland.

“Tesla P100 accelerators deliver new levels of performance and efficiency to address some of the most important computational challenges of our time,” said Thomas Schulthess, professor of computational physics at ETH Zurich and director of the Swiss National Supercomputing Center. “The upgrade of 4,500 GPU-accelerated nodes on Piz Daint to Tesla P100 GPUs will more than double the system’s performance, enabling researchers to achieve breakthroughs in a range of fields, including cosmology, materials science, seismology and climatology.”

Fujitsu’s ARM Race

ISC16 brought the prospect of another approach to major HPC workloads: the use of low-power processors from ARM, which are best known as the chips powering mobile devices like the iPhone. A small number of companies have been working for years to create ARM chips to power servers, with limited market impact.

Thus, it was a surprise when Fujitsu announced plans to use 64-bit ARM chips to power an upgrade to Japan’s mighty K Supercomputer, currently the number five entry in the Top500. The new system, called Post-K, will be installed at the RIKEN Advanced Institute for Computational Science in Japan in 2020.

A slide from Fujitsu’s ISC16 presentation outlining its plan to use ARM chipsin the next-generation K supercomputer. (Source: Fujitsu)

Fujitsu used SPARC64 processors in the initial K machine, but has been offering ARM microprocessors for some time. Fujitsu’s embrace of these chips its flagship supercomputing project gives ARM a champion in the HPC ecosystem, and raises the prospect that an ARM-powered machine could be a contender to become the first exascale system.