Table of Contents
Fetching ...

High-performance Vector-length Agnostic Quantum Circuit Simulations on ARM Processors

Ruimin Shi, Gabin Schieffer, Pei-Hung Lin, Maya Gokhale, Andreas Herten, Ivy Peng

TL;DR

The paper tackles the challenge of achieving high-performance portable quantum state-vector simulations across vector-length agnostic architectures (SVE/RVV) by porting and optimizing Google's Qsim for ARM SVE. It introduces a VLA design with techniques such as VLEN-adaptive memory layout, buffering, fine-grained loop control, and gate fusion, implemented via SVE intrinsics in a single-source codebase. Empirical evaluation across Grace, Graviton, and A64FX shows up to 4.5× speedup on A64FX, 2.5× on Grace, and 1.5× on Graviton, along with detailed PMU-based insights into vectorization activity and memory bottlenecks. The findings provide concrete guidance for future VLA hardware and compiler design to achieve portable, high-performance quantum simulations on diverse processors.

Abstract

ARM SVE and RISC-V RVV are emerging vector architectures in high-end processors that support vectorization of flexible vector length. In this work, we leverage an important workload for quantum computing, quantum state-vector simulations, to understand whether high-performance portability can be achieved in a vector-length agnostic (VLA) design. We propose a VLA design and optimization techniques critical for achieving high performance, including VLEN-adaptive memory layout adjustment, load buffering, fine-grained loop control, and gate fusion-based arithmetic intensity adaptation. We provide an implementation in Google's Qsim and evaluate five quantum circuits of up to 36 qubits on three ARM processors, including NVIDIA Grace, AWS Graviton3, and Fujitsu A64FX. By defining new metrics and PMU events to quantify vectorization activities, we draw generic insights for future VLA designs. Our single-source implementation of VLA quantum simulations achieves up to 4.5x speedup on A64FX, 2.5x speedup on Grace, and 1.5x speedup on Graviton.

High-performance Vector-length Agnostic Quantum Circuit Simulations on ARM Processors

TL;DR

The paper tackles the challenge of achieving high-performance portable quantum state-vector simulations across vector-length agnostic architectures (SVE/RVV) by porting and optimizing Google's Qsim for ARM SVE. It introduces a VLA design with techniques such as VLEN-adaptive memory layout, buffering, fine-grained loop control, and gate fusion, implemented via SVE intrinsics in a single-source codebase. Empirical evaluation across Grace, Graviton, and A64FX shows up to 4.5× speedup on A64FX, 2.5× on Grace, and 1.5× on Graviton, along with detailed PMU-based insights into vectorization activity and memory bottlenecks. The findings provide concrete guidance for future VLA hardware and compiler design to achieve portable, high-performance quantum simulations on diverse processors.

Abstract

ARM SVE and RISC-V RVV are emerging vector architectures in high-end processors that support vectorization of flexible vector length. In this work, we leverage an important workload for quantum computing, quantum state-vector simulations, to understand whether high-performance portability can be achieved in a vector-length agnostic (VLA) design. We propose a VLA design and optimization techniques critical for achieving high performance, including VLEN-adaptive memory layout adjustment, load buffering, fine-grained loop control, and gate fusion-based arithmetic intensity adaptation. We provide an implementation in Google's Qsim and evaluate five quantum circuits of up to 36 qubits on three ARM processors, including NVIDIA Grace, AWS Graviton3, and Fujitsu A64FX. By defining new metrics and PMU events to quantify vectorization activities, we draw generic insights for future VLA designs. Our single-source implementation of VLA quantum simulations achieves up to 4.5x speedup on A64FX, 2.5x speedup on Grace, and 1.5x speedup on Graviton.
Paper Structure (22 sections, 4 equations, 15 figures, 4 tables)

This paper contains 22 sections, 4 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Vectorize a loop with conditional statement using predication to mask out the 2nd and 3rd elements.
  • Figure 2: The performance of five quantum algorithms on three ARM platforms using VLA via compiler auto-vectorization.
  • Figure 3: The pseudo code of ApplyGate kernel and its disassembly code by compiler auto-vectorization.
  • Figure 4: The interleaved memory access pattern in auto-vectorized ApplyGate.
  • Figure 5: The VLA design of applying gates on an $n$-qubit quantum system. In this example, the Hadamard gate acts on q3, $k=3$, $numVals=4$
  • ...and 10 more figures