Parallelizing Program Execution on Distributed Quantum Systems via Compiler/Hardware Co-Design
Folkert de Ronde, Alexander Knapen, Stephan Wong, Sebastian Feld
TL;DR
This work addresses the bottleneck of sequential execution in distributed quantum computers by proposing a hardware architecture and a compiler capable of exploiting instruction-level parallelism. The two-level hierarchical network enables flexible, low-latency addressing while enabling parallel and pipelined execution across node controllers. The compiler augments this by scheduling, decomposing, and subnet-aware marking of instructions to maximize parallelism, significantly reducing runtime. Empirical results on NV-center-based systems show compiler speedups up to $13.55\times$, hardware speedups up to $6.0\times$, and combined speedups reaching $56.2\times$ for some benchmarks, with performance strongly dependent on the algorithm’s parallelism. This co-design approach advances practical, high-performance quantum computing by balancing throughput and latency through adaptable architecture and compiler strategies.
Abstract
As quantum computers continue to improve and support larger, more complex computations, smart control hardware and compilers are needed to efficiently leverage the capabilities of these systems. This paper introduces a novel approach to enhance the execution of quantum algorithms on distributed quantum systems. The proposed method involves the development of a hardware design that supports parallel instruction execution and a compiler that modifies the order of instructions to increase parallelism opportunities. The hardware design can be flexibly configured to facilitate parallel execution of instructions that have identical parameters. Furthermore, the compiler uses the underlying hardware constraints to intelligently reorder and decompose instructions to avoid dependencies. The compiler, hardware, and their combination are evaluated using a runtime calculator and a benchmark quantum algorithm set. The results demonstrate a significant speedup, achieving a maximum average speedup of 16.5x and a maximum single-benchmark speedup of 56.2x relative to a baseline, serial execution model. Furthermore, we show a speedup can be obtained across all benchmarks using any of the proposed hardware schemes, although the degree of speedup is largely dependent on the type of quantum algorithm. Taken together, the results of this paper represent a significant step towards realizing high-performance quantum computing systems.
