Table of Contents
Fetching ...

Realization of Thread Level Parallelism on Quantum Devices

Keren Li, Zidong Lin, Zheng An, Guanru Feng, Zipeng Wu, Shiyao Hou, Jingen Xiang

TL;DR

This work introduces a classical linkage scheme that merges multiple independent quantum processing units (QPUs) into a single logical device, enabling thread-level parallelism (TLP), and shows that quantum routines with product-state inputs and low-rank entangling layers can be re-expressed in an efficient parallelizable form.

Abstract

Scaling up quantum devices is a central challenge for realizing practical quantum computation. Modular quantum architectures promise scalability, yet experiments to date have relied on either $\sim\!10^{3}$-qubit monolithic chips or fragile interconnects with high loss. Here, we introduce a classical linkage scheme that merges multiple independent quantum processing units (QPUs) into a single logical device, enabling thread-level parallelism (TLP). Theoretically, we show that quantum routines with product-state inputs and low-rank entangling layers can be re-expressed in an efficient parallelizable form. Experimentally, we validate this architecture on clusters comprising up to sixteen benchtop nuclear magnetic resonance (NMR) quantum nodes. A four-qubit Greenberger-Horne-Zeilinger (GHZ) state is partitioned into parallel two-qubit subcircuits, achieving a fidelity of $93.8\,\%$ with respect to the ideal state. A non-Hermitian evolution, implemented via a truncated Cauchy integral on Hermitian Hamiltonians, reproduces exact observables with high accuracy. Our results demonstrate that classical links suffice to scale up the logical size of quantum computations and realize general, non-unitary channels on today's hardware, opening an experimentally accessible route toward software-defined, clustered quantum accelerators.

Realization of Thread Level Parallelism on Quantum Devices

TL;DR

This work introduces a classical linkage scheme that merges multiple independent quantum processing units (QPUs) into a single logical device, enabling thread-level parallelism (TLP), and shows that quantum routines with product-state inputs and low-rank entangling layers can be re-expressed in an efficient parallelizable form.

Abstract

Scaling up quantum devices is a central challenge for realizing practical quantum computation. Modular quantum architectures promise scalability, yet experiments to date have relied on either -qubit monolithic chips or fragile interconnects with high loss. Here, we introduce a classical linkage scheme that merges multiple independent quantum processing units (QPUs) into a single logical device, enabling thread-level parallelism (TLP). Theoretically, we show that quantum routines with product-state inputs and low-rank entangling layers can be re-expressed in an efficient parallelizable form. Experimentally, we validate this architecture on clusters comprising up to sixteen benchtop nuclear magnetic resonance (NMR) quantum nodes. A four-qubit Greenberger-Horne-Zeilinger (GHZ) state is partitioned into parallel two-qubit subcircuits, achieving a fidelity of with respect to the ideal state. A non-Hermitian evolution, implemented via a truncated Cauchy integral on Hermitian Hamiltonians, reproduces exact observables with high accuracy. Our results demonstrate that classical links suffice to scale up the logical size of quantum computations and realize general, non-unitary channels on today's hardware, opening an experimentally accessible route toward software-defined, clustered quantum accelerators.

Paper Structure

This paper contains 13 sections, 2 theorems, 39 equations, 13 figures, 9 tables, 1 algorithm.

Key Result

Lemma C.1

Let be an $n$-qubit circuit of depth $m_d$, where at each layer $i$, $G^{(i)}_{e}$ is a set of two-qubit gates that may couple any pair of qubits, and $U^{(i)}_{a}$ acts locally on qubit $a$.

Figures (13)

  • Figure 1: (a) Instruction-, data-, and thread-level parallelism in a classical processor. (b) Factorized evaluation on modular quantum hardware: Trajectory in a generalized Bloch sphere is decomposed into blocks that factorize across subsystems. Local trajectories (on the generalized Bloch sphere of subsystem) are measured independently, and then classically aggregated to recover the global result.
  • Figure 2: Hardware overview. (a) Clustered-QPU architecture enabling TLP in quantum computing. (b) Physical sample assembly employed in this work.
  • Figure 3: (a) Experimental quantum circuit for preparing a four-qubit GHZ state. The CZ gate between qubits 2 and 3 has been "cut" and implemented as a linear combination of local operations. (b) and (c) depict real and imaginary parts of the reconstructed density matrices for our method and the circuit-cut state, with the target value is shown with transparent blocks.
  • Figure 4: (a) sketches the basic idea of linear combination of Hamiltonian simulations. (b) Time evolution of $\langle\sigma_{y,z}\rangle$ and a randomly generated Hermitian observable under a non-Hermitian Hamiltonian. (c) Measured $\langle H(\gamma)\rangle$, and $\langle\sigma_{x,z}\rangle$ after imaginary-time evolution of $H(\gamma)$. Solid lines with filled points indicate experimental data; hollow points with solid represents simulation result via experimental method; while dashed lines represent results via first principles calculations. In (c) dashed-dotted lines represent results from exact diagnolazation and longer-time imaginary-time evolution. (d) and (e) show the fidelity of the experimentally prepared states relative to the numerical predictions, while the median fidelity and the 90 % threshold are highlighted.
  • Figure S1: Single-ancilla estimator: an ancillary qubit is used to estimate overlaps.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Lemma C.1: Layer-wise decomposition
  • Lemma C.2: Single-ancilla estimator