Table of Contents
Fetching ...

Performance Optimization of 3D Stencil Computation on ARM Scalable Vector Extension

Hongguang Chen

TL;DR

The paper addresses optimizing 3D 7-point stencil performance on ARM SVE by integrating Roofline analysis, Gem5-based simulations, and cacti-area modeling to study how hardware parameters (cache size, SVE length) and software techniques (vectorization, tiling, OpenMP) affect throughput, energy, and chip area. It demonstrates that vector length and cache configurations significantly influence performance, with code-level optimizations delivering up to 4.5x speedups and hardware trade-offs showing diminishing returns beyond certain cache sizes. The findings highlight ARM SVE as a viable path for HPC stencil workloads and provide guidance on balancing performance with energy and area, while acknowledging limitations due to simulation-based evaluation and the single-kernel scope. The work lays groundwork for broader ARM-SVE optimizations in HPC, including multi-core and distributed contexts.

Abstract

Stencil computation is essential in high-performance computing, especially for large-scale tasks like liquid simulation and weather forecasting. Optimizing its performance can reduce both energy consumption and computation time, which is critical in disaster prediction. This paper explores optimization techniques for 7-point 3D stencil computation on ARM's Scalable Vector Extension (SVE), using the Roofline model and tools like Gem5 and cacti. We evaluate software optimizations such as vectorization and tiling, as well as hardware adjustments in ARM SVE vector lengths and cache configurations. The study also examines performance, power consumption, and chip area trade-offs to identify optimal configurations for ARM-based systems.

Performance Optimization of 3D Stencil Computation on ARM Scalable Vector Extension

TL;DR

The paper addresses optimizing 3D 7-point stencil performance on ARM SVE by integrating Roofline analysis, Gem5-based simulations, and cacti-area modeling to study how hardware parameters (cache size, SVE length) and software techniques (vectorization, tiling, OpenMP) affect throughput, energy, and chip area. It demonstrates that vector length and cache configurations significantly influence performance, with code-level optimizations delivering up to 4.5x speedups and hardware trade-offs showing diminishing returns beyond certain cache sizes. The findings highlight ARM SVE as a viable path for HPC stencil workloads and provide guidance on balancing performance with energy and area, while acknowledging limitations due to simulation-based evaluation and the single-kernel scope. The work lays groundwork for broader ARM-SVE optimizations in HPC, including multi-core and distributed contexts.

Abstract

Stencil computation is essential in high-performance computing, especially for large-scale tasks like liquid simulation and weather forecasting. Optimizing its performance can reduce both energy consumption and computation time, which is critical in disaster prediction. This paper explores optimization techniques for 7-point 3D stencil computation on ARM's Scalable Vector Extension (SVE), using the Roofline model and tools like Gem5 and cacti. We evaluate software optimizations such as vectorization and tiling, as well as hardware adjustments in ARM SVE vector lengths and cache configurations. The study also examines performance, power consumption, and chip area trade-offs to identify optimal configurations for ARM-based systems.

Paper Structure

This paper contains 13 sections, 8 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: 7-point 3D stencil computation
  • Figure 2: Performance on different workloads with fix cache size
  • Figure 3: Speed up with different code optimization
  • Figure 4: Assembly Code for Different Optimization Parameters: ‘ Auto’ (Left) and Manual ‘ SVE’ (Right)
  • Figure 5: Cache Size, Vector Length, and Performance
  • ...and 1 more figures