Hybrid quantum programming with PennyLane Lightning on HPC platforms
Ali Asadi, Amintor Dusko, Chae-Yeun Park, Vincent Michaud-Rioux, Isidor Schoch, Shuli Shu, Trevor Vincent, Lee James O'Riordan
TL;DR
This paper presents PennyLane Lightning, a suite of high-performance state-vector quantum simulators designed for CPU, GPU, and HPC environments. It introduces specialized gate kernels, SIMD-accelerated implementations, and backend architectures (Lightning-Qubit, Lightning-GPU, Lightning-Kokkos) to enable scalable, differentiable quantum workloads, including QAOA and VQE, with demonstrated reach up to $30$ qubits on a single device and $41$ qubits across multiple nodes. The study provides microbenchmark and end-to-end results showing AVX-512/SIMD gains, GPU advantages for larger problems, and MPI-distributed capabilities for circuit cutting, batched VQE, and large-scale sampling/gradient tasks. By combining task-based, batched, and distributed execution strategies, Lightning supports a wide range of HPC workloads, enabling efficient classical-quantum co-processing and providing practical guidance on resource requirements for large-scale quantum workloads. The work highlights the practical impact of HPC-aware quantum software, offering performance-portable backends and native gradient support to accelerate quantum algorithm development and validation on contemporary supercomputing platforms.
Abstract
We introduce PennyLane's Lightning suite, a collection of high-performance state-vector simulators targeting CPU, GPU, and HPC-native architectures and workloads. Quantum applications such as QAOA, VQE, and synthetic workloads are implemented to demonstrate the supported classical computing architectures and showcase the scale of problems that can be simulated using our tooling. We benchmark the performance of Lightning with backends supporting CPUs, as well as NVidia and AMD GPUs, and compare the results to other commonly used high-performance simulator packages, demonstrating where Lightning's implementations give performance leads. We show improved CPU performance by employing explicit SIMD intrinsics and multi-threading, batched task-based execution across multiple GPUs, and distributed forward and gradient-based quantum circuit executions across multiple nodes. Our data shows we can comfortably simulate a variety of circuits, giving examples with up to 30 qubits on a single device or node, and up to 41 qubits using multiple nodes.
