Table of Contents
Fetching ...

Achieving Energetic Superiority Through System-Level Quantum Circuit Simulation

Rong Fu, Zhongling Su, Han-Sen Zhong, Xiti Zhao, Jianyang Zhang, Feng Pan, Pan Zhang, Xianhe Zhao, Ming-Cheng Chen, Chao-Yang Lu, Jian-Wei Pan, Zhiling Pei, Xingcheng Zhang, Wanli Ouyang

TL;DR

This work targets the scalability and energy efficiency gap in simulating random quantum circuits on classical hardware. It introduces a three-level parallelization scheme, hybrid low-precision inter-node communication, and an extended complex-half Einsum framework, enabling large tensor networks (up to tens of terabytes, across up to 2,304 GPUs) to be contracted efficiently. The authors demonstrate time-to-solution improvements by factors of up to an order of magnitude and substantial energy reductions compared with Google's Sycamore, including a best-case 17.18 seconds at 0.29 kWh with XEB 0.002 for a 32T network with post-processing. The results challenge the notion that quantum hardware inherently outperforms all classical approaches for RQC sampling in these regimes and point toward broader, scalable applications in quantum simulation and beyond.

Abstract

Quantum Computational Superiority boasts rapid computation and high energy efficiency. Despite recent advances in classical algorithms aimed at refuting the milestone claim of Google's sycamore, challenges remain in generating uncorrelated samples of random quantum circuits. In this paper, we present a groundbreaking large-scale system technology that leverages optimization on global, node, and device levels to achieve unprecedented scalability for tensor networks. This enables the handling of large-scale tensor networks with memory capacities reaching tens of terabytes, surpassing memory space constraints on a single node. Our techniques enable accommodating large-scale tensor networks with up to tens of terabytes of memory, reaching up to 2304 GPUs with a peak computing power of 561 PFLOPS half-precision. Notably, we have achieved a time-to-solution of 14.22 seconds with energy consumption of 2.39 kWh which achieved fidelity of 0.002 and our most remarkable result is a time-to-solution of 17.18 seconds, with energy consumption of only 0.29 kWh which achieved a XEB of 0.002 after post-processing, outperforming Google's quantum processor Sycamore in both speed and energy efficiency, which recorded 600 seconds and 4.3 kWh, respectively.

Achieving Energetic Superiority Through System-Level Quantum Circuit Simulation

TL;DR

This work targets the scalability and energy efficiency gap in simulating random quantum circuits on classical hardware. It introduces a three-level parallelization scheme, hybrid low-precision inter-node communication, and an extended complex-half Einsum framework, enabling large tensor networks (up to tens of terabytes, across up to 2,304 GPUs) to be contracted efficiently. The authors demonstrate time-to-solution improvements by factors of up to an order of magnitude and substantial energy reductions compared with Google's Sycamore, including a best-case 17.18 seconds at 0.29 kWh with XEB 0.002 for a 32T network with post-processing. The results challenge the notion that quantum hardware inherently outperforms all classical approaches for RQC sampling in these regimes and point toward broader, scalable applications in quantum simulation and beyond.

Abstract

Quantum Computational Superiority boasts rapid computation and high energy efficiency. Despite recent advances in classical algorithms aimed at refuting the milestone claim of Google's sycamore, challenges remain in generating uncorrelated samples of random quantum circuits. In this paper, we present a groundbreaking large-scale system technology that leverages optimization on global, node, and device levels to achieve unprecedented scalability for tensor networks. This enables the handling of large-scale tensor networks with memory capacities reaching tens of terabytes, surpassing memory space constraints on a single node. Our techniques enable accommodating large-scale tensor networks with up to tens of terabytes of memory, reaching up to 2304 GPUs with a peak computing power of 561 PFLOPS half-precision. Notably, we have achieved a time-to-solution of 14.22 seconds with energy consumption of 2.39 kWh which achieved fidelity of 0.002 and our most remarkable result is a time-to-solution of 17.18 seconds, with energy consumption of only 0.29 kWh which achieved a XEB of 0.002 after post-processing, outperforming Google's quantum processor Sycamore in both speed and energy efficiency, which recorded 600 seconds and 4.3 kWh, respectively.
Paper Structure (25 sections, 11 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 11 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Performance of implementations of sampling the Sycamore circuit. The horizontal axis donates the time-to-solution, and the vertical axis donates the energy consumption in the quantum experiment or classical simulations. Circles and squares correspond to classical simulations and quantum experiments, respectively. The hollow circle indicates a correlated sampling loophole in the corresponding classical simulation. The region characterized by misty rose demonstrates superior results in terms of time and energy consumption.
  • Figure 2: The relationship between spatial complexity and temporal complexity. Given a certain amount of memory limits, (a) shows the minimal time complexity of contraction paths where red and green hollow pentagrams represent chosen optimal solutions, 4TB and 32 TB, respectively. (b) draws time complexity distributions of multiple contraction paths searched by simulated annealing under various limited memory sizes ranging from 64GB to 2PB. For each memory constraint, we took its minimum time complexity value in (b) as its optimal contraction path, corresponding to a point with the same color in (a).
  • Figure 3: Example quantum circuit instance.
  • Figure 4: Architectural Overview and Example of Parallel Scheme. (a) Overview of the three-level parallel scheme: the task commences at the global level, then the tensor network is partitioned into parallel, independent sub-networks. Data is subsequently segmented across nodes within a multi-node level, interconnected through InfiniBand. Finally, the data is divided into sizes compatible with individual devices within each node, connected via high-bandwidth NVLink. (b) Example of 2-Node-4-Device Communication: we exhibit a subtask that encompasses two nodes, each hosting two devices, and demonstrate the data permutation occurring across both inter-node and intra-node levels.
  • Figure 5: The bottom part is tensor multiplication by traditionally retrieving tensors through indices. The top part is multiplication between source tensor and padding-tensor which retrieved through a 2-dimensionalizing index. $A_{I}$, $B_{I}$ are multiplication input tensors which are indexing from source tensors A, B. $B_{P}$ is a padding-tensor. C and $C_{P}$ are different products of two multiplications.
  • ...and 3 more figures