Table of Contents
Fetching ...

Atlas: Hierarchical Partitioning for Quantum Circuit Simulation on GPUs (Extended Version)

Mingkuan Xu, Shiyi Cao, Xupeng Miao, Umut A. Acar, Zhihao Jia

TL;DR

This paper presents techniques for theoretically and practically efficient and scalable Schrödinger-style quantum circuit simulation, and develops Atlas, a distributed, multi-GPU quantum circuit simulator that outperforms state-of-the-art GPU-based simulators by more than $2 \times on average.

Abstract

This paper presents techniques for theoretically and practically efficient and scalable Schrödinger-style quantum circuit simulation. Our approach partitions a quantum circuit into a hierarchy of subcircuits and simulates the subcircuits on multi-node GPUs, exploiting available data parallelism while minimizing communication costs. To minimize communication costs, we formulate an Integer Linear Program that rewards simulation of "nearby" gates on "nearby" GPUs. To maximize throughput, we use a dynamic programming algorithm to compute the subcircuit simulated by each kernel at a GPU. We realize these techniques in Atlas, a distributed, multi-GPU quantum circuit simulator. Our evaluation on a variety of quantum circuits shows that Atlas outperforms state-of-the-art GPU-based simulators by more than 2$\times$ on average and is able to run larger circuits via offloading to DRAM, outperforming other large-circuit simulators by two orders of magnitude.

Atlas: Hierarchical Partitioning for Quantum Circuit Simulation on GPUs (Extended Version)

TL;DR

This paper presents techniques for theoretically and practically efficient and scalable Schrödinger-style quantum circuit simulation, and develops Atlas, a distributed, multi-GPU quantum circuit simulator that outperforms state-of-the-art GPU-based simulators by more than $2 \times on average.

Abstract

This paper presents techniques for theoretically and practically efficient and scalable Schrödinger-style quantum circuit simulation. Our approach partitions a quantum circuit into a hierarchy of subcircuits and simulates the subcircuits on multi-node GPUs, exploiting available data parallelism while minimizing communication costs. To minimize communication costs, we formulate an Integer Linear Program that rewards simulation of "nearby" gates on "nearby" GPUs. To maximize throughput, we use a dynamic programming algorithm to compute the subcircuit simulated by each kernel at a GPU. We realize these techniques in Atlas, a distributed, multi-GPU quantum circuit simulator. Our evaluation on a variety of quantum circuits shows that Atlas outperforms state-of-the-art GPU-based simulators by more than 2 on average and is able to run larger circuits via offloading to DRAM, outperforming other large-circuit simulators by two orders of magnitude.
Paper Structure (40 sections, 6 theorems, 11 equations, 37 figures, 2 tables, 5 algorithms)

This paper contains 40 sections, 6 theorems, 11 equations, 37 figures, 2 tables, 5 algorithms.

Key Result

Theorem 1

alg:ilp returns the minimum feasible number of stages.

Figures (37)

  • Figure 1: An example application of circuit partitioning and execution. We stage the circuit so that qubits of each gate map to local qubits (i.e., green lines in each stage). The notation $q_i[p_j]$ indicates that the $i$-th logical qubit maps to the $j$-th physical qubit. The Kernelize algorithm then partitions the gates of each stage into kernels that provide for data parallelism.
  • Figure 2: Sharding with different types of communication. The simulation has 1 local, 1 regional, and 1 global qubit. $p_i - q_j$ indicates that the $i$-th physical qubit is mapped to the $j$-th logical qubit. Inter-node communication is triggered if we update any global qubits (\ref{['fig:inter_node_communications']}), and only intra-node communication is triggered otherwise (\ref{['fig:inter_device_communications']}).
  • Figure 3: Kernel examples. The two green dashed kernels satisfy \ref{['dp-constraint']}. The two blue dotted kernels do not satisfy \ref{['dp-constraint']} and thus are not considered by the Kernelize algorithm.
  • Figure 4: An example DP state in the implementation. The circuit sequence is $\mathcal{C} = [g_0, g_1, \dots, g_9]$.
  • Figure 5: Weak scaling of Atlas, HyQuas, cuQuantum, and Qiskit with 28 local qubits as the number of global qubits increases from 0 (on 1 GPU) to 8 (on 256 GPUs). Qiskit is slow and usually does not fit into our charts.
  • ...and 32 more figures

Theorems & Definitions (15)

  • Definition 1: Local, regional, and global qubits
  • Definition 2: Insular Qubit
  • Theorem 1: Optimality of $\Call{Stage}{}$
  • proof
  • Theorem 2: Correctness of the $\Call{Kernelize}{}$ algorithm
  • proof
  • Theorem 3: \ref{['dp-constraint']} allows all contiguous kernels
  • proof
  • Definition 3: Extensible qubit for a kernel
  • Theorem 4
  • ...and 5 more