Multi-GPU Quantum Circuit Simulation and the Impact of Network Performance
W. Michael Brown, Anurag Ramesh, Thomas Lubinski, Thien Nguyen, David E. Bernal Neira
TL;DR
This paper addresses the bottlenecked scalability of classical state-vector quantum circuit simulations on HPC systems. It introduces MPI support into the QED-C Application-Oriented Benchmarks and benchmarks multi-GPU configurations across diverse interconnects, including the multi-node NVL (MNNVL) fabric on NVIDIA Grace Blackwell. The study demonstrates that interconnect improvements can yield over 16× faster time-to-solution for multi-GPU simulations, with MNNVL and NVLink-based paths outperforming PCIe and InfiniBand alternatives, and highlights how data movement dominates scaling at large GPU counts. The results inform system architects and researchers about the critical importance of high-bandwidth, low-latency interconnects and the practical benefits of GPU-aware MPI, fabric memory, and low-level interconnect optimizations for quantum simulation workloads. Collectively, the work provides actionable guidance for deploying large-scale quantum circuit simulations on HPC platforms and underscores the trajectory for future hardware and software co-design in this domain.
Abstract
As is intrinsic to the fundamental goal of quantum computing, classical simulation of quantum algorithms is notoriously demanding in resource requirements. Nonetheless, simulation is critical to the success of the field and a requirement for algorithm development and validation, as well as hardware design. GPU-acceleration has become standard practice for simulation, and due to the exponential scaling inherent in classical methods, multi-GPU simulation can be required to achieve representative system sizes. In this case, inter-GPU communications can bottleneck performance. In this work, we present the introduction of MPI into the QED-C Application-Oriented Benchmarks to facilitate benchmarking on HPC systems. We review the advances in interconnect technology and the APIs for multi-GPU communication. We benchmark using a variety of interconnect paths, including the recent NVIDIA Grace Blackwell NVL72 architecture that represents the first product to expand high-bandwidth GPU-specialized interconnects across multiple nodes. We show that while improvements to GPU architecture have led to speedups of over 4.5X across the last few generations of GPUs, advances in interconnect performance have had a larger impact with over 16X performance improvements in time to solution for multi-GPU simulations.
