Table of Contents
Fetching ...

Lazy Qubit Reordering for Accelerating Parallel State-Vector-based Quantum Circuit Simulation

Yusuke Teranishi, Shoma Hiraoka, Wataru Mizukami, Masao Okita, Fumihiko Ino

TL;DR

The out-of-order approach eliminates redundant reorderings by introducing intentional delays in reordering communications such that multiple reorderings can be aggregated into a single reordering.

Abstract

This paper proposes two quantum operation scheduling methods for accelerating parallel state-vector-based quantum circuit simulation using multiple graphics processing units (GPUs). The proposed methods reduce all-to-all communication caused by qubit reordering (QR), which can dominate the overhead of parallel simulation. Our approach eliminates redundant QRs by introducing intentional delays in QR communications such that multiple QRs can be aggregated into a single QR. The delays are carefully introduced based on the principles of time-space tiling, or a cache optimization technique for classical computers, which we use to arrange the execution order of quantum operations. Moreover, we present an extended scheduling method for the hierarchical interconnection of GPU cluster systems to avoid slow inter-node communication. We develop these methods tailored for two primary procedures in variational quantum eigensolver (VQE) simulation: quantum state update (QSU) and expectation value computation (EVC). Experimental validation on 32-GPU executions demonstrates acceleration in QSU and EVC -- up to 54$\times$ and 606$\times$, respectively -- compared to existing methods. Moreover, our extended scheduling method further reduced communication time by up to 15\% in a two-layered interconnected cluster system. Our approach is useful for any quantum circuit simulations, including QSU and/or EVC.

Lazy Qubit Reordering for Accelerating Parallel State-Vector-based Quantum Circuit Simulation

TL;DR

The out-of-order approach eliminates redundant reorderings by introducing intentional delays in reordering communications such that multiple reorderings can be aggregated into a single reordering.

Abstract

This paper proposes two quantum operation scheduling methods for accelerating parallel state-vector-based quantum circuit simulation using multiple graphics processing units (GPUs). The proposed methods reduce all-to-all communication caused by qubit reordering (QR), which can dominate the overhead of parallel simulation. Our approach eliminates redundant QRs by introducing intentional delays in QR communications such that multiple QRs can be aggregated into a single QR. The delays are carefully introduced based on the principles of time-space tiling, or a cache optimization technique for classical computers, which we use to arrange the execution order of quantum operations. Moreover, we present an extended scheduling method for the hierarchical interconnection of GPU cluster systems to avoid slow inter-node communication. We develop these methods tailored for two primary procedures in variational quantum eigensolver (VQE) simulation: quantum state update (QSU) and expectation value computation (EVC). Experimental validation on 32-GPU executions demonstrates acceleration in QSU and EVC -- up to 54 and 606, respectively -- compared to existing methods. Moreover, our extended scheduling method further reduced communication time by up to 15\% in a two-layered interconnected cluster system. Our approach is useful for any quantum circuit simulations, including QSU and/or EVC.
Paper Structure (28 sections, 30 equations, 32 figures, 1 table, 6 algorithms)

This paper contains 28 sections, 30 equations, 32 figures, 1 table, 6 algorithms.

Figures (32)

  • Figure 1: Example of a quantum circuit for a QSU and its QCT form. The dependencies of the gates are as follows: $U_0 \succ U_2$, $U_1 \succ U_2$, $U_2 \succ U_3$, and $U_2 \succ U_4$ result in $U_0 \overset{+}{\succ} U_3$, $U_1 \overset{+}{\succ} U_3$, $U_2 \overset{+}{\succ} U_3$, $U_0 \overset{+}{\succ} U_4$, $U_1 \overset{+}{\succ} U_4$, and $U_2 \overset{+}{\succ} U_4$.
  • Figure 2: Diagram of EVC utilizing the QCT form. In this example, we assume that $P_0=Z_0\otimes Y_1\otimes X_3\otimes Y_4\otimes Z_7$, $P_1=X_2\otimes Z_6\otimes X_7$, and $P_2=X_0\otimes Y_3\otimes Z_4\otimes X_6$. It is noteworthy that there is no data dependency among $P_0$, $P_1$, and $P_2$. Notice that this diagram, which is not a quantum circuit, illustrates a logical flow of EVC.
  • Figure 3: Example of state vector distribution in the case of a 4-qubit system on 4 PEs. A state vector for a 4-qubit system encompasses $2^4$ probability amplitudes, ranging from $c_{0000}$ to $c_{1111}$. PE $j$ deals with a sub state-vector $\ket{\psi_j}$ consisting of $2^4/4$ elements in p-SVQCS. For example, PE 0 and PE 1 handle elements from $c_{0000}$ to $c_{0011}$ and those from $c_{0100}$ to $c_{0111}$, respectively.
  • Figure 4: Qubit mapping scheme. This figure shows the binary index $(b_0, b_1, \ldots, b_7 \in \{0,1\})$ of the probability amplitude $c$ in an $n$-qubit system simulated on $n_\mathrm{node}$ nodes, each equipped with $n_\mathrm{gpn}$ GPUs. In this example, we set $n=8$, $n_\mathrm{node}=4$, and $n_\mathrm{gpn}=8$. Consequently, the total number of GPUs is $p=n_\mathrm{node}\times n_\mathrm{gpn}=32$. The dashed lines denote the qubit set divisions. The segmented sections of the binary index exhibit regularity based on the state-vector distribution across a two-layered interconnection. Within the sub state-vector on a GPU, the bits corresponding to local qubits are permuted from all zeros to all ones, while those corresponding to global qubits remain identical. Similarly, the bits corresponding to inter qubits remain identical across the sub state-vectors on a node.
  • Figure 5:
  • ...and 27 more figures