Efficient techniques to GPU Accelerations of Multi-Shot Quantum Computing Simulations

Jun Doi; Hiroshi Horii; Christopher Wood

Efficient techniques to GPU Accelerations of Multi-Shot Quantum Computing Simulations

Jun Doi, Hiroshi Horii, Christopher Wood

TL;DR

This work tackles the high cost of simulating noisy multi-shot quantum circuits on classical hardware by focusing on GPU-accelerated execution. It introduces two orthogonal techniques—batch-shots, which batches multiple shots into a single GPU kernel to reduce host overheads, and shot-branching, which shares a common state across shots and branches only when randomness occurs—to improve performance across the qubit spectrum. Implemented in Qiskit Aer and evaluated on a GPU-rich IBM Power AC922 cluster, the methods achieve up to about 10x–100x speedups depending on noise model and circuit size, with scalable results across multiple GPUs and nodes. The study also clarifies when each technique is advantageous and points to future work combining the approaches and extending them to cuQuantum APIs and CPU implementations.

Abstract

Quantum computers are becoming practical for computing numerous applications. However, simulating quantum computing on classical computers is still demanding yet useful because current quantum computers are limited because of computer resources, hardware limits, instability, and noises. Improving quantum computing simulation performance in classical computers will contribute to the development of quantum computers and their algorithms. Quantum computing simulations on classical computers require long performance times, especially for quantum circuits with a large number of qubits or when simulating a large number of shots for noise simulations or circuits with intermediate measures. Graphical processing units (GPU) are suitable to accelerate quantum computer simulations by exploiting their computational power and high bandwidth memory and they have a large advantage in simulating relatively larger qubits circuits. However, GPUs are inefficient at simulating multi-shots runs with noises because the randomness prevents highly parallelization. In addition, GPUs have a disadvantage in simulating circuits with a small number of qubits because of the large overheads in GPU kernel execution. In this paper, we introduce optimization techniques for multi-shot simulations on GPUs. We gather multiple shots of simulations into a single GPU kernel execution to reduce overheads by scheduling randomness caused by noises. In addition, we introduce shot-branching that reduces calculations and memory usage for multi-shot simulations. By using these techniques, we speed up x10 from previous implementations.

Efficient techniques to GPU Accelerations of Multi-Shot Quantum Computing Simulations

TL;DR

Abstract

Paper Structure (29 sections, 1 equation, 18 figures, 2 tables)

This paper contains 29 sections, 1 equation, 18 figures, 2 tables.

Introduction
Related Work
Quantum Computing Simulation on Qiskit Aer
Qiskit Aer Overview
Multi-shots and Noise Simulations in Qiskit Aer
Multi-shots and Noise Simulations in Qiskit Aer
Pauli Noise Model
Kraus Noise Model
GPU Acceleration and Issues in Qiskit Aer
Overview of GPU Support in Qiskit Aer
Performance Issues in GPU Acceleration
Acceleration of Multi-shot Simulations
Implementing Batch-Shots Technique
Scheduling and Batching Multi-shot Simulation
Data Structure for Batched Multi-Shot Kernel Execution
...and 14 more sections

Figures (18)

Figure 1: Comparison of GPU overheads loaded on host CPU for multi-shot simulation. a) A small number of qubits have relatively large overheads compared with b) a large number of qubits, where the overheads can be ignored.
Figure 2: Simulation time comparison of QFT circuit with Pauli noise (4000 shots, error rate = 0.01) on Qiskit Aer running on an IBM Power System AC922 with a single NVIDIA Tesla V100 (16GB)
Figure 3: Batched execution of multi-shots in a single kernel. a) Calling the kernel for each shot loads large GPU overheads to host CPU, versus b) calling only one overhead for batched kernel execution.
Figure 4: Data structure for multi-shot simulation on a GPU. There are three vectors of data: classical bit register to store conditions, qubit register to store probability amplitudes, and parameter buffers to store data used for gate operations for each shot.
Figure 5: Runtime noise sampling of Pauli noise for each shot. ID gates are added for shots if no noise is sampled to apply Pauli noise in a batched kernel.
...and 13 more figures

Efficient techniques to GPU Accelerations of Multi-Shot Quantum Computing Simulations

TL;DR

Abstract

Efficient techniques to GPU Accelerations of Multi-Shot Quantum Computing Simulations

Authors

TL;DR

Abstract

Table of Contents

Figures (18)