Table of Contents
Fetching ...

Circuit Partitioning and Full Circuit Execution: A Comparative Study of GPU-Based Quantum Circuit Simulation

Kartikey Sarode, Daniel E. Huang, E. Wes Bethel

TL;DR

The paper tackles the challenge of simulating large quantum circuits beyond NISQ devices by comparing circuit-splitting via CutQC against full-circuit execution with distributed memory on GPUs. It combines CutQC’s circuit-cutting framework with GPU-accelerated statevector simulation (Qiskit Aer-GPU) to reconstruct the original circuit probabilities and contrasts it with distributed full-circuit statevector simulation. The main finding is that full-circuit execution is faster on a single node, while circuit-splitting incurs exponential post-processing costs ($4^{K}$ per number of cuts and $9^{n}$ for cut CNOTs) but can reduce memory by narrowing subcircuits by about 30–40% in width, making it potentially advantageous under resource constraints. The work clarifies the runtime-memory tradeoffs between approaches and suggests hybrid strategies for large-scale quantum circuit simulation, with implications for scalable algorithm validation on classical hardware.

Abstract

Executing large quantum circuits is not feasible using the currently available NISQ (noisy intermediate-scale quantum) devices. The high costs of using real quantum devices make it further challenging to research and develop quantum algorithms. As a result, performing classical simulations is usually the preferred method for researching and validating large-scale quantum algorithms. However, these simulations require a huge amount of resources, as each additional qubit exponentially increases the computational space required. Distributed Quantum Computing (DQC) is a promising alternative to reduce the resources required for simulating large quantum algorithms at the cost of increased runtime. This study presents a comparative analysis of two simulation methods: circuit-splitting and full-circuit execution using distributed memory, each having a different type of overhead. The first method, using CutQC, cuts the circuit into smaller subcircuits and allows us to simulate a large quantum circuit on smaller machines. The second method, using Qiskit-Aer-GPU, distributes the computational space across a distributed memory system to simulate the entire quantum circuit. Results indicate that full-circuit executions are faster than circuit-splitting for simulations performed on a single node. However, circuit-splitting simulations show promising results in specific scenarios as the number of qubits is scaled.

Circuit Partitioning and Full Circuit Execution: A Comparative Study of GPU-Based Quantum Circuit Simulation

TL;DR

The paper tackles the challenge of simulating large quantum circuits beyond NISQ devices by comparing circuit-splitting via CutQC against full-circuit execution with distributed memory on GPUs. It combines CutQC’s circuit-cutting framework with GPU-accelerated statevector simulation (Qiskit Aer-GPU) to reconstruct the original circuit probabilities and contrasts it with distributed full-circuit statevector simulation. The main finding is that full-circuit execution is faster on a single node, while circuit-splitting incurs exponential post-processing costs ( per number of cuts and for cut CNOTs) but can reduce memory by narrowing subcircuits by about 30–40% in width, making it potentially advantageous under resource constraints. The work clarifies the runtime-memory tradeoffs between approaches and suggests hybrid strategies for large-scale quantum circuit simulation, with implications for scalable algorithm validation on classical hardware.

Abstract

Executing large quantum circuits is not feasible using the currently available NISQ (noisy intermediate-scale quantum) devices. The high costs of using real quantum devices make it further challenging to research and develop quantum algorithms. As a result, performing classical simulations is usually the preferred method for researching and validating large-scale quantum algorithms. However, these simulations require a huge amount of resources, as each additional qubit exponentially increases the computational space required. Distributed Quantum Computing (DQC) is a promising alternative to reduce the resources required for simulating large quantum algorithms at the cost of increased runtime. This study presents a comparative analysis of two simulation methods: circuit-splitting and full-circuit execution using distributed memory, each having a different type of overhead. The first method, using CutQC, cuts the circuit into smaller subcircuits and allows us to simulate a large quantum circuit on smaller machines. The second method, using Qiskit-Aer-GPU, distributes the computational space across a distributed memory system to simulate the entire quantum circuit. Results indicate that full-circuit executions are faster than circuit-splitting for simulations performed on a single node. However, circuit-splitting simulations show promising results in specific scenarios as the number of qubits is scaled.

Paper Structure

This paper contains 21 sections, 15 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Procedure to cut one qubit wire. The wire between vertices $u$ and $v$ (left) can be cut by (as shown on the right) summing over four pairs of measurement circuits appended to $u$ and state initialization circuits prepended to $v$. Measurement circuits in the I and Z basis have the same physical implementation. The three different upstream measurement circuits and the four different downstream initialization circuits are now separate and can be independently evaluated. (Image Source: Tang, W. et al. tang2021cutqc)
  • Figure 2: Example of cutting a five-qubit circuit into two smaller subcircuits of three qubits each. The subcircuits are produced by cutting the $q_2$ wire between the first two $cZ$ gates. The three variations of $subcircuit_1$ and four variations of $subcircuit_2$ can then be evaluated on a 3-qubit quantum device, instead of a 5-qubit device. The classical postprocessing involves summing over four Kronecker products between the two subcircuits for the one cut made. (Image Source: Tang, W. et al. tang2021cutqc)
  • Figure 3: Different chunking methods. (a) A naive implementation of probability amplitude exchange between distributed memory spaces. In this example, we apply a gate to qubit $k=n-1$. This implementation requires twice the memory space. (b) Dividing a state vector into small chunks and performing probability amplitude exchange by chunks. So, we only need one additional chunk per memory space to perform probability amplitude exchange. (Image Source: Doi, J. et al. qiskitGPUpaper)
  • Figure 4: Example of using cache blocking on a quantum circuit. (a) shows the input circuit consisting of u1, u3 and CNOT gates. nc denotes the number of qubits of a chunk. The gates on qubits $>$ nc need to refer to probability amplitudes over multiple chunks to be simulated. (b) shows the output circuit after cache blocking is performed. Four swap gates are added to move all the gates to qubits $<$nc; now, all the gates can be performed without referring to probability amplitudes over chunks. (Image Source: Doi, J. et al. qiskitGPUpaper)
  • Figure 5: This figure represents how the circuits are generated and saved as QPY files to avoid re-generating circuits for every benchmark run. A Slurm job runs a Python script which executes the required circuit generator and saves the circuit to disk using the QPY module in Qiskit.
  • ...and 1 more figures