Table of Contents
Fetching ...

Introducing the Arm-membench Throughput Benchmark

Cyrill Burth, Markus Velten, Robert Schöne

TL;DR

This work addresses the bottlenecks in memory subsystem performance on Arm architectures by porting the x86-membench throughput benchmark to Armv8 with NEON and SVE support, enabling fine-grained analysis of memory bandwidth across L1/L2/L3 caches and main memory. The Arm-membench design preserves the original throughput-focused approach while adapting to Arm's load-store architecture and runtime vector-length discovery. Across Fujitsu A64FX, Ampere Altra, and Marvell ThunderX2, the benchmark reveals that front-end fetch/decode and vector-width considerations significantly shape achievable throughput, with strong variance across architectures; L1d bandwidth saturates primarily when SIMD data paths are fed with appropriate memory workloads. The results demonstrate near-linear scaling with core count for some configurations and emphasize the practical impact of memory-subsystem design on Arm server performance, offering a valuable tool for architecture research and performance modeling. The work also discusses limitations and outlines future extensions to latency analysis and MESI-aware cache-coherence benchmarking.

Abstract

Application performance of modern day processors is often limited by the memory subsystem rather than actual compute capabilities. Therefore, data throughput specifications play a key role in modeling application performance and determining possible bottlenecks. However, while peak instruction throughputs and bandwidths for local caches are often documented, the achievable throughput can also depend on the relation between memory access and compute instructions. In this paper, we present an Arm version of the well established x86-membench throughput benchmark, which we have adapted to support all current SIMD extensions of the Armv8 instruction set architecture. We describe aspects of the Armv8 ISA that need to be considered in the portable design of this benchmark. We use the benchmark to analyze the memory subsystem at a fine spatial granularity and to unveil microarchitectural details of three processors: Fujitsu A64FX, Ampere Altra and Cavium ThunderX2. Based on the resulting performance information, we show that instruction fetch and decoder widths become a potential bottleneck for cache-bandwidth-sensitive workloads due to the load-store concept of the Arm ISA.

Introducing the Arm-membench Throughput Benchmark

TL;DR

This work addresses the bottlenecks in memory subsystem performance on Arm architectures by porting the x86-membench throughput benchmark to Armv8 with NEON and SVE support, enabling fine-grained analysis of memory bandwidth across L1/L2/L3 caches and main memory. The Arm-membench design preserves the original throughput-focused approach while adapting to Arm's load-store architecture and runtime vector-length discovery. Across Fujitsu A64FX, Ampere Altra, and Marvell ThunderX2, the benchmark reveals that front-end fetch/decode and vector-width considerations significantly shape achievable throughput, with strong variance across architectures; L1d bandwidth saturates primarily when SIMD data paths are fed with appropriate memory workloads. The results demonstrate near-linear scaling with core count for some configurations and emphasize the practical impact of memory-subsystem design on Arm server performance, offering a valuable tool for architecture research and performance modeling. The work also discusses limitations and outlines future extensions to latency analysis and MESI-aware cache-coherence benchmarking.

Abstract

Application performance of modern day processors is often limited by the memory subsystem rather than actual compute capabilities. Therefore, data throughput specifications play a key role in modeling application performance and determining possible bottlenecks. However, while peak instruction throughputs and bandwidths for local caches are often documented, the achievable throughput can also depend on the relation between memory access and compute instructions. In this paper, we present an Arm version of the well established x86-membench throughput benchmark, which we have adapted to support all current SIMD extensions of the Armv8 instruction set architecture. We describe aspects of the Armv8 ISA that need to be considered in the portable design of this benchmark. We use the benchmark to analyze the memory subsystem at a fine spatial granularity and to unveil microarchitectural details of three processors: Fujitsu A64FX, Ampere Altra and Cavium ThunderX2. Based on the resulting performance information, we show that instruction fetch and decoder widths become a potential bottleneck for cache-bandwidth-sensitive workloads due to the load-store concept of the Arm ISA.

Paper Structure

This paper contains 15 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Relative L1d cache performance of post-increment implementation.
  • Figure 2: A64FX: Throughput of different SVE instructions for all memory levels using a single core and all 48 cores in [GB/s] and ([B/cycle]). The standard deviation is $<$ 1% for all cases.
  • Figure 3: A64FX: Bandwidth of SVE with different numbers of registers loaded per instruction using LD1D, LD2D and LD4D for loading one, two or four registers.
  • Figure 4: A64FX: HBM2 scaling behavior of STREAM TRIAD and the Arm-membench throughput benchmark on one socket, starting with cores in CMG 0. Values from A64FX-ECM and A64FX_HPC_APPs for 48 cores shown as reference. Both use STREAM TRIAD with zero fills.
  • Figure 5: Ampere Altra: Throughput of different NEON instructions for all memory levels using a single core and all 80 cores in [GB/s] and ([B/cycle]). Multicore L3 accesses could not be distinguished due to small L3 size (denoted as '--'). The standard deviation is $<$ 1% for all cases except multicore L1 LOAD ($\approx$ 3%).
  • ...and 1 more figures