Table of Contents
Fetching ...

Parallelizing Program Execution on Distributed Quantum Systems via Compiler/Hardware Co-Design

Folkert de Ronde, Alexander Knapen, Stephan Wong, Sebastian Feld

TL;DR

This work addresses the bottleneck of sequential execution in distributed quantum computers by proposing a hardware architecture and a compiler capable of exploiting instruction-level parallelism. The two-level hierarchical network enables flexible, low-latency addressing while enabling parallel and pipelined execution across node controllers. The compiler augments this by scheduling, decomposing, and subnet-aware marking of instructions to maximize parallelism, significantly reducing runtime. Empirical results on NV-center-based systems show compiler speedups up to $13.55\times$, hardware speedups up to $6.0\times$, and combined speedups reaching $56.2\times$ for some benchmarks, with performance strongly dependent on the algorithm’s parallelism. This co-design approach advances practical, high-performance quantum computing by balancing throughput and latency through adaptable architecture and compiler strategies.

Abstract

As quantum computers continue to improve and support larger, more complex computations, smart control hardware and compilers are needed to efficiently leverage the capabilities of these systems. This paper introduces a novel approach to enhance the execution of quantum algorithms on distributed quantum systems. The proposed method involves the development of a hardware design that supports parallel instruction execution and a compiler that modifies the order of instructions to increase parallelism opportunities. The hardware design can be flexibly configured to facilitate parallel execution of instructions that have identical parameters. Furthermore, the compiler uses the underlying hardware constraints to intelligently reorder and decompose instructions to avoid dependencies. The compiler, hardware, and their combination are evaluated using a runtime calculator and a benchmark quantum algorithm set. The results demonstrate a significant speedup, achieving a maximum average speedup of 16.5x and a maximum single-benchmark speedup of 56.2x relative to a baseline, serial execution model. Furthermore, we show a speedup can be obtained across all benchmarks using any of the proposed hardware schemes, although the degree of speedup is largely dependent on the type of quantum algorithm. Taken together, the results of this paper represent a significant step towards realizing high-performance quantum computing systems.

Parallelizing Program Execution on Distributed Quantum Systems via Compiler/Hardware Co-Design

TL;DR

This work addresses the bottleneck of sequential execution in distributed quantum computers by proposing a hardware architecture and a compiler capable of exploiting instruction-level parallelism. The two-level hierarchical network enables flexible, low-latency addressing while enabling parallel and pipelined execution across node controllers. The compiler augments this by scheduling, decomposing, and subnet-aware marking of instructions to maximize parallelism, significantly reducing runtime. Empirical results on NV-center-based systems show compiler speedups up to , hardware speedups up to , and combined speedups reaching for some benchmarks, with performance strongly dependent on the algorithm’s parallelism. This co-design approach advances practical, high-performance quantum computing by balancing throughput and latency through adaptable architecture and compiler strategies.

Abstract

As quantum computers continue to improve and support larger, more complex computations, smart control hardware and compilers are needed to efficiently leverage the capabilities of these systems. This paper introduces a novel approach to enhance the execution of quantum algorithms on distributed quantum systems. The proposed method involves the development of a hardware design that supports parallel instruction execution and a compiler that modifies the order of instructions to increase parallelism opportunities. The hardware design can be flexibly configured to facilitate parallel execution of instructions that have identical parameters. Furthermore, the compiler uses the underlying hardware constraints to intelligently reorder and decompose instructions to avoid dependencies. The compiler, hardware, and their combination are evaluated using a runtime calculator and a benchmark quantum algorithm set. The results demonstrate a significant speedup, achieving a maximum average speedup of 16.5x and a maximum single-benchmark speedup of 56.2x relative to a baseline, serial execution model. Furthermore, we show a speedup can be obtained across all benchmarks using any of the proposed hardware schemes, although the degree of speedup is largely dependent on the type of quantum algorithm. Taken together, the results of this paper represent a significant step towards realizing high-performance quantum computing systems.

Paper Structure

This paper contains 38 sections, 9 equations, 15 figures, 4 tables, 5 algorithms.

Figures (15)

  • Figure 1: The quantum computing system stack (left-hand side) with a detailed structure of the control architecture, quantum-classical interface, and quantum chip layers (right-hand side).
  • Figure 2: Node controller selection using an ID-encoded address vs. a bitmap-encoded address. (a) An ID-encoded address has a smaller width but can only target a single node controller. (b) A bitmap-encoded address has a larger width but is capable of targeting multiple node controllers simultaneously.
  • Figure 3: Mapping of a logical qubit to four physical qubit system based on a semi-distributed (top) or fully distributed (bottom) mode. In the semi-distributed mode, two physical qubits are put in the same node. In the fully distributed mode, every physical qubit is put on a different node.
  • Figure 4: A basic decomposition for the Rx, Ry, Rz and CX gates for the semi-distributed mode. The Rx and Ry gates have equivalent decompositions. The CX gate decomposition shows black dotted and orange dotted lines, representing a sequential and pipelined operation respectively.
  • Figure 5: Full physical CX gate decomposition (adapted from cx_gate). The first squiggly line represents an entangle operation. The operations presented in red and yellow represent Rx and Ry operations that can potentially be executed in parallel with Rx or Ry operations following the CX gate. The dotted lines coming from the measurement gate connect the measurement value with the controlled operation.
  • ...and 10 more figures