Atlas: Hierarchical Partitioning for Quantum Circuit Simulation on GPUs (Extended Version)

Mingkuan Xu; Shiyi Cao; Xupeng Miao; Umut A. Acar; Zhihao Jia

Atlas: Hierarchical Partitioning for Quantum Circuit Simulation on GPUs (Extended Version)

Mingkuan Xu, Shiyi Cao, Xupeng Miao, Umut A. Acar, Zhihao Jia

TL;DR

This paper presents techniques for theoretically and practically efficient and scalable Schrödinger-style quantum circuit simulation, and develops Atlas, a distributed, multi-GPU quantum circuit simulator that outperforms state-of-the-art GPU-based simulators by more than $2 \times on average.

Abstract

This paper presents techniques for theoretically and practically efficient and scalable Schrödinger-style quantum circuit simulation. Our approach partitions a quantum circuit into a hierarchy of subcircuits and simulates the subcircuits on multi-node GPUs, exploiting available data parallelism while minimizing communication costs. To minimize communication costs, we formulate an Integer Linear Program that rewards simulation of "nearby" gates on "nearby" GPUs. To maximize throughput, we use a dynamic programming algorithm to compute the subcircuit simulated by each kernel at a GPU. We realize these techniques in Atlas, a distributed, multi-GPU quantum circuit simulator. Our evaluation on a variety of quantum circuits shows that Atlas outperforms state-of-the-art GPU-based simulators by more than 2$\times$ on average and is able to run larger circuits via offloading to DRAM, outperforming other large-circuit simulators by two orders of magnitude.

Atlas: Hierarchical Partitioning for Quantum Circuit Simulation on GPUs (Extended Version)

TL;DR

Abstract

on average and is able to run larger circuits via offloading to DRAM, outperforming other large-circuit simulators by two orders of magnitude.

Paper Structure (40 sections, 6 theorems, 11 equations, 37 figures, 2 tables, 5 algorithms)

This paper contains 40 sections, 6 theorems, 11 equations, 37 figures, 2 tables, 5 algorithms.

Introduction
Background
Hierarchical Partitioning for Simulation
Circuit Staging
Circuit staging problem
ILP-based circuit staging.
Circuit Kernelization
Implementation
Reducing Size of DP State in Kernelize
Cost Function in Kernelize
Atlas
Evaluation
Experimental Setup
Benchmarks
Preprocessing circuits
...and 25 more sections

Key Result

Theorem 1

alg:ilp returns the minimum feasible number of stages.

Figures (37)

Figure 1: An example application of circuit partitioning and execution. We stage the circuit so that qubits of each gate map to local qubits (i.e., green lines in each stage). The notation $q_i[p_j]$ indicates that the $i$-th logical qubit maps to the $j$-th physical qubit. The Kernelize algorithm then partitions the gates of each stage into kernels that provide for data parallelism.
Figure 2: Sharding with different types of communication. The simulation has 1 local, 1 regional, and 1 global qubit. $p_i - q_j$ indicates that the $i$-th physical qubit is mapped to the $j$-th logical qubit. Inter-node communication is triggered if we update any global qubits (\ref{['fig:inter_node_communications']}), and only intra-node communication is triggered otherwise (\ref{['fig:inter_device_communications']}).
Figure 3: Kernel examples. The two green dashed kernels satisfy \ref{['dp-constraint']}. The two blue dotted kernels do not satisfy \ref{['dp-constraint']} and thus are not considered by the Kernelize algorithm.
Figure 4: An example DP state in the implementation. The circuit sequence is $\mathcal{C} = [g_0, g_1, \dots, g_9]$.
Figure 5: Weak scaling of Atlas, HyQuas, cuQuantum, and Qiskit with 28 local qubits as the number of global qubits increases from 0 (on 1 GPU) to 8 (on 256 GPUs). Qiskit is slow and usually does not fit into our charts.
...and 32 more figures

Theorems & Definitions (15)

Definition 1: Local, regional, and global qubits
Definition 2: Insular Qubit
Theorem 1: Optimality of $\Call{Stage}{}$
proof
Theorem 2: Correctness of the $\Call{Kernelize}{}$ algorithm
proof
Theorem 3: \ref{['dp-constraint']} allows all contiguous kernels
proof
Definition 3: Extensible qubit for a kernel
Theorem 4
...and 5 more

Atlas: Hierarchical Partitioning for Quantum Circuit Simulation on GPUs (Extended Version)

TL;DR

Abstract

Atlas: Hierarchical Partitioning for Quantum Circuit Simulation on GPUs (Extended Version)

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (37)

Theorems & Definitions (15)