Table of Contents
Fetching ...

SW-TNC : Reaching the Most Complex Random Quantum Circuit via Tensor Network Contraction

Yaojian Chen, Zhaoqi Sun, Chengyu Qiu, Zegang Li, Yanfei Liu, Lin Gan, Xiaohui Duan, Guangwen Yang

TL;DR

This work advances classical simulation of large random quantum circuits by optimizing tensor-network contraction on the Sunway architecture. It introduces data-reuse strategies (tree-like and spindle-like), core-array fusion with RMA, and in-kernel vectorized permutation, along with split-common TTGT to handle diverse contraction patterns. The combination yields substantial complexity reductions and performance gains, demonstrated by over 10× speedups on Zuchongzhi-60-24 across 1024+ Sunway nodes, and strong scalability up to thousands of processes. These techniques not only push the practical limits of classical RQC simulation but also offer broadly transferable insights for high-performance tensor computations and quantum-device verification.

Abstract

Classical simulation is essential in quantum algorithm development and quantum device verification. With the increasing complexity and diversity of quantum circuit structures, existing classical simulation algorithms need to be improved and extended. In this work, we propose novel strategies for tensor network contraction based simulator on Sunway architecture. Our approach addresses three main aspects: complexity, computational paradigms and fine-grained optimization. Data reuse schemes are designed to reduce floating-point operations, and memory organization techniques are employed to eliminate slicing overhead while maintaining parallelism. Step fusion strategy is extended by multi-core cooperation to improve the data locality and computation intensity. Fine-grained optimizations, such as in-kernel vectorized permutations, and split-K operators, are developed as well to address the challenges in new hotspot distribution and topological structure. These innovations can accelerate the simulation of the Zuchongzhi-60-24 by more than 10 times, using more than 1024 Sunway nodes (399,360 cores). Our work demonstrates the potential for enabling efficient classical simulation of increasingly complex quantum circuits.

SW-TNC : Reaching the Most Complex Random Quantum Circuit via Tensor Network Contraction

TL;DR

This work advances classical simulation of large random quantum circuits by optimizing tensor-network contraction on the Sunway architecture. It introduces data-reuse strategies (tree-like and spindle-like), core-array fusion with RMA, and in-kernel vectorized permutation, along with split-common TTGT to handle diverse contraction patterns. The combination yields substantial complexity reductions and performance gains, demonstrated by over 10× speedups on Zuchongzhi-60-24 across 1024+ Sunway nodes, and strong scalability up to thousands of processes. These techniques not only push the practical limits of classical RQC simulation but also offer broadly transferable insights for high-performance tensor computations and quantum-device verification.

Abstract

Classical simulation is essential in quantum algorithm development and quantum device verification. With the increasing complexity and diversity of quantum circuit structures, existing classical simulation algorithms need to be improved and extended. In this work, we propose novel strategies for tensor network contraction based simulator on Sunway architecture. Our approach addresses three main aspects: complexity, computational paradigms and fine-grained optimization. Data reuse schemes are designed to reduce floating-point operations, and memory organization techniques are employed to eliminate slicing overhead while maintaining parallelism. Step fusion strategy is extended by multi-core cooperation to improve the data locality and computation intensity. Fine-grained optimizations, such as in-kernel vectorized permutations, and split-K operators, are developed as well to address the challenges in new hotspot distribution and topological structure. These innovations can accelerate the simulation of the Zuchongzhi-60-24 by more than 10 times, using more than 1024 Sunway nodes (399,360 cores). Our work demonstrates the potential for enabling efficient classical simulation of increasingly complex quantum circuits.

Paper Structure

This paper contains 24 sections, 4 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Distribution of slicing overhead of different circuits. Sycamore-53google-nature-2019, Zuchongzhi-56zuchongzhi and Zuchongzhi-60zuchongzhi2 are chosen for testing. Circuits are represented as name + qubits + cycles (s53_12 denotes to Sycamore-53-12), and sorted by its minimum FLOPs. Memory limitation is set as rank-31. For each circuit, we searched 300 paths.
  • Figure 2: Visualized structure of contraction trees. Each node represents a contraction step, and darker nodes indicates higher complexity. 300 contraction trees are searched for each circuit and one typical case is shown. a) Sycamore-53-20 circuitgoogle-nature-2019. b) Zuchongzhi-60-24 circuitzuchongzhi2.
  • Figure 3: Tree-like data reuse. In both sub-figures, the sequence number represents the order of execution, and the thick solid lines denotes to continuous contraction steps. In a), slicing happens when lifetime of an index starts, and every subtask will be split in two. Intermediate tensors will be stored when a thick line ends, and deleted when both of its two children start. In b), merging works at the end of lifetime, and the two subtasks alongside the corresponding dimension will be reduced. When both of two subtasks paired on a index ends, the intermediate tensor stored by the first one can be deleted.
  • Figure 4: Spindle-like reuse, which is the combination of pre-lifetime and post-lifetime reuse. The sequence number represents the order of execution, and the thick solid lines denotes to continuous contraction steps. Intermediate tensors are stored and deleted following the rules of both pre-lifetime and post-lifetime reuse.
  • Figure 5: RMA communication scheme to lengthen the fused section. Without loss of generality, here shows 4 CPEs cooperation. A rank-5 tensor is distributed in 4 CPEs, with 3 indices intra-LDM and 2 indices inter-LDM. When the inter-LDM indices need to be contracted, RMA works to swap indices. After communication, an inter-LDM index and an intra-LDM index exchanged positions.
  • ...and 9 more figures