Efficient techniques to GPU Accelerations of Multi-Shot Quantum Computing Simulations
Jun Doi, Hiroshi Horii, Christopher Wood
TL;DR
This work tackles the high cost of simulating noisy multi-shot quantum circuits on classical hardware by focusing on GPU-accelerated execution. It introduces two orthogonal techniques—batch-shots, which batches multiple shots into a single GPU kernel to reduce host overheads, and shot-branching, which shares a common state across shots and branches only when randomness occurs—to improve performance across the qubit spectrum. Implemented in Qiskit Aer and evaluated on a GPU-rich IBM Power AC922 cluster, the methods achieve up to about 10x–100x speedups depending on noise model and circuit size, with scalable results across multiple GPUs and nodes. The study also clarifies when each technique is advantageous and points to future work combining the approaches and extending them to cuQuantum APIs and CPU implementations.
Abstract
Quantum computers are becoming practical for computing numerous applications. However, simulating quantum computing on classical computers is still demanding yet useful because current quantum computers are limited because of computer resources, hardware limits, instability, and noises. Improving quantum computing simulation performance in classical computers will contribute to the development of quantum computers and their algorithms. Quantum computing simulations on classical computers require long performance times, especially for quantum circuits with a large number of qubits or when simulating a large number of shots for noise simulations or circuits with intermediate measures. Graphical processing units (GPU) are suitable to accelerate quantum computer simulations by exploiting their computational power and high bandwidth memory and they have a large advantage in simulating relatively larger qubits circuits. However, GPUs are inefficient at simulating multi-shots runs with noises because the randomness prevents highly parallelization. In addition, GPUs have a disadvantage in simulating circuits with a small number of qubits because of the large overheads in GPU kernel execution. In this paper, we introduce optimization techniques for multi-shot simulations on GPUs. We gather multiple shots of simulations into a single GPU kernel execution to reduce overheads by scheduling randomness caused by noises. In addition, we introduce shot-branching that reduces calculations and memory usage for multi-shot simulations. By using these techniques, we speed up x10 from previous implementations.
