End-to-end performance of quantum-accelerated large-scale linear algebra workflows

Daiwei Zhu; Miguel Angel Lopez-Ruiz; François-Henry Rouet; Claudio Girotto; Willie Aboumrad; Robert Lucas; Ananth Kaushik; Martin Roetteler

End-to-end performance of quantum-accelerated large-scale linear algebra workflows

Daiwei Zhu, Miguel Angel Lopez-Ruiz, François-Henry Rouet, Claudio Girotto, Willie Aboumrad, Robert Lucas, Ananth Kaushik, Martin Roetteler

Abstract

Solving large-scale sparse linear systems is a challenging computational task due to the introduction of non-zero elements, or "fill-in." The Graph Partitioning Problem (GPP) arises naturally when minimizing fill-in and accelerating solvers. In this paper, we measure the end-to-end performance of a hybrid quantum-classical framework designed to accelerate Finite Element Analysis (FEA) by integrating a quantum solver for GPP into Synopsys/Ansys' LS-DYNA multiphysics simulation software. The quantum solver we use is based on Iterative-QAOA, a scalable, non-variational quantum approach for optimization. We focus on two specific classes of FEA problems, namely vibrational (eigenmode) analysis and transient simulation. We report numerical simulations on up to 150 qubits done on NVIDIA's CUDA-Q/cuTensorNet and implementation on IonQ's Forte quantum hardware. The potential impact on LS-DYNA workflows is quantified by measuring the wall-clock time-to-solution for complex problem instances, including vibrational analysis of large finite element models of a sedan car and a Rolls-Royce jet engine, as well as transient simulations of a drill and an impeller. We performed end-to-end performance measurements on meshes comprising up to 35 million elements. Measurements were conducted using LS-DYNA in distributed-memory mode via Message Passing Interface (MPI) on AWS and Synopsys compute clusters. Our findings indicate that with a quantum computer in the loop, amortized LS-DYNA wall-clock time can be improved by up to 15% for specific cases and by at least 7% for all models considered. These results highlight the significant potential of quantum computing to reduce time-to-solution for large-scale FEA simulations within the Noisy Intermediate-Scale Quantum (NISQ) era, offering an approach that is scalable and extendable into the fault-tolerant quantum computing regime.

End-to-end performance of quantum-accelerated large-scale linear algebra workflows

Abstract

Paper Structure (16 sections, 9 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 9 equations, 11 figures, 3 tables, 1 algorithm.

Introduction
Graph Partitioning Problem Formulation
The Iterative-QAOA algorithm
Methods
Overview of the end-to-end workflow
Generation of coarsened graphs
Application of Iterative-QAOA
Classical heuristic refinement
Defining the figures of merit
IonQ quantum hardware
Results
Results from simulations and hardware execution
Evaluation of figures of merit
Conclusions and Outlook
Supplementary Information
...and 1 more sections

Figures (11)

Figure 1: End-to-end quantum-accelerated linear algebra workflow. Starting from the mesh/adjacency graph of the sparse system, the original graph is coarsened to a hardware-matched size, the coarse graph partitioning problem (GPP) is solved on the QPU (Iterative-QAOA), and the resulting coarse graph bi-partition is lifted to a full-resolution graph partition. The lifted partition defines the separator/ordering used downstream for matrix reordering and sparse factorization.
Figure 2: LR-QAOA performance landscape for a 24-qubit drill instance. The dashed horizontal line marks the optimal $\Delta = 1.0$ that minimizes $\ev{ H_C }$.
Figure 3: Iterative-QAOA executed on a 120-qubit Drill problem instance using the NVIDIA CUDA-Q/cuTensorNet MPS simulator with a bond dimension $\chi = 256$. Each panel shows the cost probability distributions at the initial (Iter = 0), an intermediate (Iter = 3), and the final (Iter = 9) iteration. The number of layers in the QAOA circuit was $p = 5$. The algorithm parameters used are $\Delta = 0.3$.
Figure 4: Total wall-clock time (WCT) reduction as a function of coarse graph size using quantum-derived partitions. The horizontal dashed line denotes the normalized baseline ($1.0$), established by the optimal total WCT achieved via the internal LS-DYNA partitioner on the original $10,000$-node baseline graph. All relative performance data are plotted with respect to this baseline. For each graph size, box plots represent the distribution of the total WCT outcomes (with whiskers showing the 90th and 10th percentiles, box boundaries showing the 75th and 25th percentiles, and line showing the median) from 20 highest-quality partitions, with diamond markers highlighting the absolute minimum total WCT attained. Although inherent stochasticity is observed due to the factors discussed in \ref{['subsec:merit-factor-eval']}, the cumulative data indicate a discernible trend toward improved total WCT as the coarsened graph dimensionality increases.
Figure 5: Maximum percentage reduction in total wall-clock time (WCT) across the different models and coarse graph sizes. Results illustrate performance gains achieved using partitions generated by the quantum algorithm executed on IonQ Forte QPU for 36-node graphs, and NVIDIA's CUDA-Q/cuTensorNet Matrix Product States (MPS) simulator on NVIDIA A100/H100 GPUs for larger graphs. Improvement is expressed as a percentage relative to the baseline, which is defined as the minimum total WCT attained using the optimal internal LS-DYNA partition of the original $10,000$-node graph.
...and 6 more figures

End-to-end performance of quantum-accelerated large-scale linear algebra workflows

Abstract

End-to-end performance of quantum-accelerated large-scale linear algebra workflows

Authors

Abstract

Table of Contents

Figures (11)