Low-Level and NUMA-Aware Optimization for High-Performance Quantum Simulation
Ali Rezaei, Luc Jaulmes, Maria Bahna, Oliver Thomson Brown, Antonio Barbalace
TL;DR
This work tackles the memory- and bandwidth-bound problem of classical state-vector quantum circuit simulation on a single node by delivering an open-source, NUMA-aware extension to the QuEST simulator. It combines a redesigned AoS data layout, explicit NUMA-aware memory allocation and thread pinning, and AVX-512-based vectorized kernels with aggressive loop unrolling and prefetching to dramatically improve locality and throughput. Across primitive gates and circuit workloads—including single- and two-qubit gates, RQC, QFT, Grover, and Shor-like circuits—the approach yields substantial speedups (up to $5.5$–$6.5 imes$ for single-qubit gates, $4.5 imes$ for two-qubit gates, ~4× for RQC, ~1.8× for QFT, and 2.5–4.6× for Grover/Shor) on modern multi-core CPUs. By providing an open, configurable framework that explicitly exposes each optimization, the work enables reproducible performance evaluation and serves as a robust baseline for future distributed and heterogeneous scaling toward noiseless quantum computing simulations.
Abstract
Scalable classical simulation of quantum circuits is crucial for advancing quantum algorithm development and validating emerging hardware. This work focuses on performance enhancements through targeted low-level and NUMA-aware tuning on a single-node system, thereby not only advancing the efficiency of classical quantum simulations but also establishing a foundation for scalable, heterogeneous implementations that bridge toward noiseless quantum computing. Although few prior studies have reported similar hardware-level optimizations, such implementations have not been released as open-source software, limiting independent validation and further development. We introduce an open-source, high-performance extension to the QuEST state vector simulator that integrates state-of-the-art low-level and NUMA-aware optimizations for modern processors. Our approach emphasizes locality-aware computation and incorporates hardware-specific techniques including NUMA-aware memory allocation, thread pinning, AVX-512 vectorization, aggressive loop unrolling, and explicit memory prefetching. Experiments demonstrate substantial speedups--5.5-6.5x for single-qubit gate operations, 4.5x for two-qubit gates, 4x for Random Quantum Circuits (RQC), and 1.8x for the Quantum Fourier Transform (QFT). Algorithmic workloads further achieve 4.3-4.6x acceleration for Grover and 2.5x for Shor-like circuits. These results show that systematic, architecture-aware tuning can significantly extend the practical simulation capacity of classical quantum simulators on current hardware.
