Table of Contents
Fetching ...

Universal Quantum Simulation of 50 Qubits on Europe`s First Exascale Supercomputer Harnessing Its Heterogeneous CPU-GPU Architecture

Hans De Raedt, Jiri Kraus, Andreas Herten, Vrinda Mehta, Mathis Bode, Markus Hrywniak, Kristel Michielsen, Thomas Lippert

TL;DR

This work demonstrates the first universal 50-qubit quantum computer simulation on Europe’s exascale-class JUPITER system using GH200 CPU–GPU chips. It achieves this by combining memory-oversubscription across GH200s, adaptive 2-byte state-vector encoding, and on-the-fly network-traffic optimization to mitigate data movement, enabling near-linear scaling of elapsed time with the number of qubits. The key contributions are intra- and inter-GH200 communication strategies and adaptive byte-encoding, which together allow 50-qubit simulations with FP64 accuracy for selected circuits, outperforming prior records by enabling substantial memory efficiency and reduced network load. The results highlight the practical potential of JUQCS-50 for benchmarking universal quantum circuits and guiding future exascale quantum-simulation capabilities, with broad implications for VQE, QAOA, and quantum algorithm research.

Abstract

We have developed a new version of the high-performance Jülich universal quantum computer simulator (JUQCS-50) that leverages key features of the GH200 superchips as used in the JUPITER supercomputer, enabling simulations of a 50-qubit universal quantum computer for the first time. JUQCS-50 achieves this through three key innovations: (1) extending usable memory beyond GPU limits via high-bandwidth CPU-GPU interconnects and LPDDR5 memory; (2) adaptive data encoding to reduce memory footprint with acceptable trade-offs in precision and compute effort; and (3) an on-the-fly network traffic optimizer. These advances result in an 11.4-fold speedup over the previous 48-qubit record on the K computer.

Universal Quantum Simulation of 50 Qubits on Europe`s First Exascale Supercomputer Harnessing Its Heterogeneous CPU-GPU Architecture

TL;DR

This work demonstrates the first universal 50-qubit quantum computer simulation on Europe’s exascale-class JUPITER system using GH200 CPU–GPU chips. It achieves this by combining memory-oversubscription across GH200s, adaptive 2-byte state-vector encoding, and on-the-fly network-traffic optimization to mitigate data movement, enabling near-linear scaling of elapsed time with the number of qubits. The key contributions are intra- and inter-GH200 communication strategies and adaptive byte-encoding, which together allow 50-qubit simulations with FP64 accuracy for selected circuits, outperforming prior records by enabling substantial memory efficiency and reduced network load. The results highlight the practical potential of JUQCS-50 for benchmarking universal quantum circuits and guiding future exascale quantum-simulation capabilities, with broad implications for VQE, QAOA, and quantum algorithm research.

Abstract

We have developed a new version of the high-performance Jülich universal quantum computer simulator (JUQCS-50) that leverages key features of the GH200 superchips as used in the JUPITER supercomputer, enabling simulations of a 50-qubit universal quantum computer for the first time. JUQCS-50 achieves this through three key innovations: (1) extending usable memory beyond GPU limits via high-bandwidth CPU-GPU interconnects and LPDDR5 memory; (2) adaptive data encoding to reduce memory footprint with acceptable trade-offs in precision and compute effort; and (3) an on-the-fly network traffic optimizer. These advances result in an 11.4-fold speedup over the previous 48-qubit record on the K computer.

Paper Structure

This paper contains 22 sections, 1 equation, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Overview of JUPITER's quad GH200 node design with included technology and bandwidths. Each node contains four GH200 superchips, each comprising a tightly integrated CPU–GPU pair; see \ref{['sec:innovations']}.
  • Figure 2: Core of the Unified Memory Data Usage Hints implementation.
  • Figure 3: Graphical representation of the benchmark circuit \ref{['sequence']} for the case of $N=32$ qubits. Operations proceed from left to right. Each application of a Hadamard gate ($H$) changes all the elements of the state vector. The rightmost symbol represents the simultaneous measurement of all three components of the Pauli-spin matrices representing a qubit, as performed by JUQCS-50. The initial state vector has all qubits in state zero.
  • Figure 4: Total and compute elapsed times per gate operation for the range of qubits simulated on JUPITER (weak scaling). Lines are guides to the eye only.
  • Figure 5: Total elapsed and compute times per gate operation for the range of qubits simulated on JUPITER in FP32 mode without using the LPDDR5 memory as an extension (weak scaling). When comparing to \ref{['fig:gate-op']}, note the difference in scale of the $y$-axis and keep in mind that the number of GPUs used is eight times larger, see \ref{['tabFP32']}. Lines are guides to the eye only.
  • ...and 4 more figures