Table of Contents
Fetching ...

Multi-GPU Quantum Circuit Simulation and the Impact of Network Performance

W. Michael Brown, Anurag Ramesh, Thomas Lubinski, Thien Nguyen, David E. Bernal Neira

TL;DR

This paper addresses the bottlenecked scalability of classical state-vector quantum circuit simulations on HPC systems. It introduces MPI support into the QED-C Application-Oriented Benchmarks and benchmarks multi-GPU configurations across diverse interconnects, including the multi-node NVL (MNNVL) fabric on NVIDIA Grace Blackwell. The study demonstrates that interconnect improvements can yield over 16× faster time-to-solution for multi-GPU simulations, with MNNVL and NVLink-based paths outperforming PCIe and InfiniBand alternatives, and highlights how data movement dominates scaling at large GPU counts. The results inform system architects and researchers about the critical importance of high-bandwidth, low-latency interconnects and the practical benefits of GPU-aware MPI, fabric memory, and low-level interconnect optimizations for quantum simulation workloads. Collectively, the work provides actionable guidance for deploying large-scale quantum circuit simulations on HPC platforms and underscores the trajectory for future hardware and software co-design in this domain.

Abstract

As is intrinsic to the fundamental goal of quantum computing, classical simulation of quantum algorithms is notoriously demanding in resource requirements. Nonetheless, simulation is critical to the success of the field and a requirement for algorithm development and validation, as well as hardware design. GPU-acceleration has become standard practice for simulation, and due to the exponential scaling inherent in classical methods, multi-GPU simulation can be required to achieve representative system sizes. In this case, inter-GPU communications can bottleneck performance. In this work, we present the introduction of MPI into the QED-C Application-Oriented Benchmarks to facilitate benchmarking on HPC systems. We review the advances in interconnect technology and the APIs for multi-GPU communication. We benchmark using a variety of interconnect paths, including the recent NVIDIA Grace Blackwell NVL72 architecture that represents the first product to expand high-bandwidth GPU-specialized interconnects across multiple nodes. We show that while improvements to GPU architecture have led to speedups of over 4.5X across the last few generations of GPUs, advances in interconnect performance have had a larger impact with over 16X performance improvements in time to solution for multi-GPU simulations.

Multi-GPU Quantum Circuit Simulation and the Impact of Network Performance

TL;DR

This paper addresses the bottlenecked scalability of classical state-vector quantum circuit simulations on HPC systems. It introduces MPI support into the QED-C Application-Oriented Benchmarks and benchmarks multi-GPU configurations across diverse interconnects, including the multi-node NVL (MNNVL) fabric on NVIDIA Grace Blackwell. The study demonstrates that interconnect improvements can yield over 16× faster time-to-solution for multi-GPU simulations, with MNNVL and NVLink-based paths outperforming PCIe and InfiniBand alternatives, and highlights how data movement dominates scaling at large GPU counts. The results inform system architects and researchers about the critical importance of high-bandwidth, low-latency interconnects and the practical benefits of GPU-aware MPI, fabric memory, and low-level interconnect optimizations for quantum simulation workloads. Collectively, the work provides actionable guidance for deploying large-scale quantum circuit simulations on HPC platforms and underscores the trajectory for future hardware and software co-design in this domain.

Abstract

As is intrinsic to the fundamental goal of quantum computing, classical simulation of quantum algorithms is notoriously demanding in resource requirements. Nonetheless, simulation is critical to the success of the field and a requirement for algorithm development and validation, as well as hardware design. GPU-acceleration has become standard practice for simulation, and due to the exponential scaling inherent in classical methods, multi-GPU simulation can be required to achieve representative system sizes. In this case, inter-GPU communications can bottleneck performance. In this work, we present the introduction of MPI into the QED-C Application-Oriented Benchmarks to facilitate benchmarking on HPC systems. We review the advances in interconnect technology and the APIs for multi-GPU communication. We benchmark using a variety of interconnect paths, including the recent NVIDIA Grace Blackwell NVL72 architecture that represents the first product to expand high-bandwidth GPU-specialized interconnects across multiple nodes. We show that while improvements to GPU architecture have led to speedups of over 4.5X across the last few generations of GPUs, advances in interconnect performance have had a larger impact with over 16X performance improvements in time to solution for multi-GPU simulations.

Paper Structure

This paper contains 18 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Single-GPU generational speedups for CUDA-Q simulation of the 33-qubit QPE and HamLib circuits on GPUs from Ampere, Hopper, and Genesis systems. Absolute Ampere measurements were 5.2s and 72.6s for the respective benchmarks.
  • Figure 2: Weak-scaling performance for the QPE benchmark on various systems. Genesis-MPI uses CUDA-aware MPI algorithms for MNNVL where Genesis-CUDA uses the low-level VMM API. Genesis-IB and Genesis-IB-RDMA disable MNNVL with the former also disabling RDMA from the NIC. The number of qubits ranges from 33 on a single GPU to 39 on 64 GPUs.
  • Figure 3: Strong-scaling performance for the 33-Qubit QPE benchmark on various systems. Genesis-MPI uses CUDA-aware MPI algorithms for MNNVL where Genesis-CUDA uses the low-level VMM API. Genesis-IB and Genesis-IB-RDMA disable MNNVL with the former also disabling RDMA from the NIC.
  • Figure 4: Strong-scaling performance for the 33-Qubit HamLib benchmark on various systems. Genesis-MPI uses CUDA-aware MPI algorithms for MNNVL where Genesis-CUDA uses the low-level VMM API. Genesis-IB and Genesis-IB-RDMA disable MNNVL with the former also disabling RDMA from the NIC.
  • Figure 5: Speedup in circuit simulation time with 64-GPU Perlmutter performance as the baseline.