Table of Contents
Fetching ...

Low-Level and NUMA-Aware Optimization for High-Performance Quantum Simulation

Ali Rezaei, Luc Jaulmes, Maria Bahna, Oliver Thomson Brown, Antonio Barbalace

TL;DR

This work tackles the memory- and bandwidth-bound problem of classical state-vector quantum circuit simulation on a single node by delivering an open-source, NUMA-aware extension to the QuEST simulator. It combines a redesigned AoS data layout, explicit NUMA-aware memory allocation and thread pinning, and AVX-512-based vectorized kernels with aggressive loop unrolling and prefetching to dramatically improve locality and throughput. Across primitive gates and circuit workloads—including single- and two-qubit gates, RQC, QFT, Grover, and Shor-like circuits—the approach yields substantial speedups (up to $5.5$–$6.5 imes$ for single-qubit gates, $4.5 imes$ for two-qubit gates, ~4× for RQC, ~1.8× for QFT, and 2.5–4.6× for Grover/Shor) on modern multi-core CPUs. By providing an open, configurable framework that explicitly exposes each optimization, the work enables reproducible performance evaluation and serves as a robust baseline for future distributed and heterogeneous scaling toward noiseless quantum computing simulations.

Abstract

Scalable classical simulation of quantum circuits is crucial for advancing quantum algorithm development and validating emerging hardware. This work focuses on performance enhancements through targeted low-level and NUMA-aware tuning on a single-node system, thereby not only advancing the efficiency of classical quantum simulations but also establishing a foundation for scalable, heterogeneous implementations that bridge toward noiseless quantum computing. Although few prior studies have reported similar hardware-level optimizations, such implementations have not been released as open-source software, limiting independent validation and further development. We introduce an open-source, high-performance extension to the QuEST state vector simulator that integrates state-of-the-art low-level and NUMA-aware optimizations for modern processors. Our approach emphasizes locality-aware computation and incorporates hardware-specific techniques including NUMA-aware memory allocation, thread pinning, AVX-512 vectorization, aggressive loop unrolling, and explicit memory prefetching. Experiments demonstrate substantial speedups--5.5-6.5x for single-qubit gate operations, 4.5x for two-qubit gates, 4x for Random Quantum Circuits (RQC), and 1.8x for the Quantum Fourier Transform (QFT). Algorithmic workloads further achieve 4.3-4.6x acceleration for Grover and 2.5x for Shor-like circuits. These results show that systematic, architecture-aware tuning can significantly extend the practical simulation capacity of classical quantum simulators on current hardware.

Low-Level and NUMA-Aware Optimization for High-Performance Quantum Simulation

TL;DR

This work tackles the memory- and bandwidth-bound problem of classical state-vector quantum circuit simulation on a single node by delivering an open-source, NUMA-aware extension to the QuEST simulator. It combines a redesigned AoS data layout, explicit NUMA-aware memory allocation and thread pinning, and AVX-512-based vectorized kernels with aggressive loop unrolling and prefetching to dramatically improve locality and throughput. Across primitive gates and circuit workloads—including single- and two-qubit gates, RQC, QFT, Grover, and Shor-like circuits—the approach yields substantial speedups (up to for single-qubit gates, for two-qubit gates, ~4× for RQC, ~1.8× for QFT, and 2.5–4.6× for Grover/Shor) on modern multi-core CPUs. By providing an open, configurable framework that explicitly exposes each optimization, the work enables reproducible performance evaluation and serves as a robust baseline for future distributed and heterogeneous scaling toward noiseless quantum computing simulations.

Abstract

Scalable classical simulation of quantum circuits is crucial for advancing quantum algorithm development and validating emerging hardware. This work focuses on performance enhancements through targeted low-level and NUMA-aware tuning on a single-node system, thereby not only advancing the efficiency of classical quantum simulations but also establishing a foundation for scalable, heterogeneous implementations that bridge toward noiseless quantum computing. Although few prior studies have reported similar hardware-level optimizations, such implementations have not been released as open-source software, limiting independent validation and further development. We introduce an open-source, high-performance extension to the QuEST state vector simulator that integrates state-of-the-art low-level and NUMA-aware optimizations for modern processors. Our approach emphasizes locality-aware computation and incorporates hardware-specific techniques including NUMA-aware memory allocation, thread pinning, AVX-512 vectorization, aggressive loop unrolling, and explicit memory prefetching. Experiments demonstrate substantial speedups--5.5-6.5x for single-qubit gate operations, 4.5x for two-qubit gates, 4x for Random Quantum Circuits (RQC), and 1.8x for the Quantum Fourier Transform (QFT). Algorithmic workloads further achieve 4.3-4.6x acceleration for Grover and 2.5x for Shor-like circuits. These results show that systematic, architecture-aware tuning can significantly extend the practical simulation capacity of classical quantum simulators on current hardware.

Paper Structure

This paper contains 19 sections, 10 figures, 3 tables, 9 algorithms.

Figures (10)

  • Figure 1: Schematic representation of two common data layouts for complex quantum amplitudes. In the Structure of Arrays (SoA) approach (top), real and imaginary parts are stored in separate arrays, each of length $2^n$. In the Array of Structures (AoS) approach (bottom), a single array of length $2^n$ holds both real and imaginary parts together in each element.
  • Figure 2: Comparison of three state vector allocation strategies on our dual-node NUMA system.
  • Figure 3: Access pattern for in‐place matrix‐vector multiplication with single‐qubit (unitary) gates on 6 qubits. Each sub‐plot (target 0 to 5) shows how amplitude indices (vertical axis) are reordered in memory during gate application. Colored lines map amplitude blocks to four parallel CPU cores (PU 0–3).
  • Figure 4: Access pattern for in-place matrix-vector multiply with controlled gates on 6 qubits, using qubit 2 as the control. Solid lines represent amplitudes updated (control bit = 1), while dashed lines indicate skipped computations (control bit = 0).
  • Figure 5: Comparison of DRAM access for 50 consecutive Hadamard gates measured with Intel PCM on a dual-NUMA system. (a) Baseline QuEST implementation (default v3.7.0) shows unbalanced memory traffic across sockets; (b) NUMA-aware memory allocation reduces remote accesses by improving locality; and (c) further optimization using a locality-sensitive task scheduler yields near-ideal confinement of DRAM activity to local nodes. Here, the full state vector fits within a single NUMA node up to 34 qubits, while 35 qubits and above require cross-node memory access.
  • ...and 5 more figures