Table of Contents
Fetching ...

Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer

Keichi Takahashi, Soya Fujimoto, Satoru Nagase, Yoko Isobe, Yoichi Shimomura, Ryusuke Egawa, Hiroyuki Takizawa

TL;DR

The paper addresses the memory wall in HPC by evaluating the VE30-based SX-Aurora TSUBASA vector supercomputer, which combines a high-bandwidth memory subsystem with a large vector-core fabric to boost memory-intensive workloads. It conducts a comprehensive performance study using standard benchmarks, microbenchmarks, and real-world workloads (including SPEChpc and the Tohoku kernel collection), and isolates the impact of VE30’s architectural enhancements such as per-core private L3 caches, increased LLC/HBM bandwidth, and the vlfa hardware for indirect vector accumulation. Key findings show VE30 delivering strong memory-bound performance, high efficiency, and competitive single- and multi-node scaling relative to GPUs and CPUs, with substantial gains from architectural features and targeted tuning (selective L3 caching and partitioning mode). The work demonstrates VE30’s potential to achieve high sustained performance using conventional MPI+OpenMP workflows, suggesting practical adoption for memory-bound HPC workloads and informing software-tuning strategies for future vector architectures.

Abstract

Data movement is a key bottleneck in terms of both performance and energy efficiency in modern HPC systems. The NEC SX-series supercomputers have a long history of accelerating memory-intensive HPC applications by providing sufficient memory bandwidth to applications. In this paper, we analyze the performance of a prototype SX-Aurora TSUBASA supercomputer equipped with the brand-new Vector Engine (VE30) processor. VE30 is the first major update to the Vector Engine processor series, and offers significantly improved memory access performance due to its renewed memory subsystem. Moreover, it introduces new instructions and incorporates architectural advancements tailored for accelerating memory-intensive applications. Using standard benchmarks, we demonstrate that VE30 considerably outperforms other processors in both performance and efficiency of memory-intensive applications. We also evaluate VE30 using applications including SPEChpc, and show that VE30 can run real-world applications with high performance. Finally, we discuss performance tuning techniques to obtain maximum performance from VE30.

Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer

TL;DR

The paper addresses the memory wall in HPC by evaluating the VE30-based SX-Aurora TSUBASA vector supercomputer, which combines a high-bandwidth memory subsystem with a large vector-core fabric to boost memory-intensive workloads. It conducts a comprehensive performance study using standard benchmarks, microbenchmarks, and real-world workloads (including SPEChpc and the Tohoku kernel collection), and isolates the impact of VE30’s architectural enhancements such as per-core private L3 caches, increased LLC/HBM bandwidth, and the vlfa hardware for indirect vector accumulation. Key findings show VE30 delivering strong memory-bound performance, high efficiency, and competitive single- and multi-node scaling relative to GPUs and CPUs, with substantial gains from architectural features and targeted tuning (selective L3 caching and partitioning mode). The work demonstrates VE30’s potential to achieve high sustained performance using conventional MPI+OpenMP workflows, suggesting practical adoption for memory-bound HPC workloads and informing software-tuning strategies for future vector architectures.

Abstract

Data movement is a key bottleneck in terms of both performance and energy efficiency in modern HPC systems. The NEC SX-series supercomputers have a long history of accelerating memory-intensive HPC applications by providing sufficient memory bandwidth to applications. In this paper, we analyze the performance of a prototype SX-Aurora TSUBASA supercomputer equipped with the brand-new Vector Engine (VE30) processor. VE30 is the first major update to the Vector Engine processor series, and offers significantly improved memory access performance due to its renewed memory subsystem. Moreover, it introduces new instructions and incorporates architectural advancements tailored for accelerating memory-intensive applications. Using standard benchmarks, we demonstrate that VE30 considerably outperforms other processors in both performance and efficiency of memory-intensive applications. We also evaluate VE30 using applications including SPEChpc, and show that VE30 can run real-world applications with high performance. Finally, we discuss performance tuning techniques to obtain maximum performance from VE30.
Paper Structure (20 sections, 1 equation, 19 figures, 1 table)

This paper contains 20 sections, 1 equation, 19 figures, 1 table.

Figures (19)

  • Figure 1: Block diagram of the VE30 processor.
  • Figure 2: Memory hierarchy of the VE30 processor.
  • Figure 3: HPL benchmark performance.
  • Figure 4: Effective memory bandwidth.
  • Figure 5: HPCG benchmark performance.
  • ...and 14 more figures