Table of Contents
Fetching ...

Hybrid quantum programming with PennyLane Lightning on HPC platforms

Ali Asadi, Amintor Dusko, Chae-Yeun Park, Vincent Michaud-Rioux, Isidor Schoch, Shuli Shu, Trevor Vincent, Lee James O'Riordan

TL;DR

This paper presents PennyLane Lightning, a suite of high-performance state-vector quantum simulators designed for CPU, GPU, and HPC environments. It introduces specialized gate kernels, SIMD-accelerated implementations, and backend architectures (Lightning-Qubit, Lightning-GPU, Lightning-Kokkos) to enable scalable, differentiable quantum workloads, including QAOA and VQE, with demonstrated reach up to $30$ qubits on a single device and $41$ qubits across multiple nodes. The study provides microbenchmark and end-to-end results showing AVX-512/SIMD gains, GPU advantages for larger problems, and MPI-distributed capabilities for circuit cutting, batched VQE, and large-scale sampling/gradient tasks. By combining task-based, batched, and distributed execution strategies, Lightning supports a wide range of HPC workloads, enabling efficient classical-quantum co-processing and providing practical guidance on resource requirements for large-scale quantum workloads. The work highlights the practical impact of HPC-aware quantum software, offering performance-portable backends and native gradient support to accelerate quantum algorithm development and validation on contemporary supercomputing platforms.

Abstract

We introduce PennyLane's Lightning suite, a collection of high-performance state-vector simulators targeting CPU, GPU, and HPC-native architectures and workloads. Quantum applications such as QAOA, VQE, and synthetic workloads are implemented to demonstrate the supported classical computing architectures and showcase the scale of problems that can be simulated using our tooling. We benchmark the performance of Lightning with backends supporting CPUs, as well as NVidia and AMD GPUs, and compare the results to other commonly used high-performance simulator packages, demonstrating where Lightning's implementations give performance leads. We show improved CPU performance by employing explicit SIMD intrinsics and multi-threading, batched task-based execution across multiple GPUs, and distributed forward and gradient-based quantum circuit executions across multiple nodes. Our data shows we can comfortably simulate a variety of circuits, giving examples with up to 30 qubits on a single device or node, and up to 41 qubits using multiple nodes.

Hybrid quantum programming with PennyLane Lightning on HPC platforms

TL;DR

This paper presents PennyLane Lightning, a suite of high-performance state-vector quantum simulators designed for CPU, GPU, and HPC environments. It introduces specialized gate kernels, SIMD-accelerated implementations, and backend architectures (Lightning-Qubit, Lightning-GPU, Lightning-Kokkos) to enable scalable, differentiable quantum workloads, including QAOA and VQE, with demonstrated reach up to qubits on a single device and qubits across multiple nodes. The study provides microbenchmark and end-to-end results showing AVX-512/SIMD gains, GPU advantages for larger problems, and MPI-distributed capabilities for circuit cutting, batched VQE, and large-scale sampling/gradient tasks. By combining task-based, batched, and distributed execution strategies, Lightning supports a wide range of HPC workloads, enabling efficient classical-quantum co-processing and providing practical guidance on resource requirements for large-scale quantum workloads. The work highlights the practical impact of HPC-aware quantum software, offering performance-portable backends and native gradient support to accelerate quantum algorithm development and validation on contemporary supercomputing platforms.

Abstract

We introduce PennyLane's Lightning suite, a collection of high-performance state-vector simulators targeting CPU, GPU, and HPC-native architectures and workloads. Quantum applications such as QAOA, VQE, and synthetic workloads are implemented to demonstrate the supported classical computing architectures and showcase the scale of problems that can be simulated using our tooling. We benchmark the performance of Lightning with backends supporting CPUs, as well as NVidia and AMD GPUs, and compare the results to other commonly used high-performance simulator packages, demonstrating where Lightning's implementations give performance leads. We show improved CPU performance by employing explicit SIMD intrinsics and multi-threading, batched task-based execution across multiple GPUs, and distributed forward and gradient-based quantum circuit executions across multiple nodes. Our data shows we can comfortably simulate a variety of circuits, giving examples with up to 30 qubits on a single device or node, and up to 41 qubits using multiple nodes.
Paper Structure (30 sections, 1 equation, 8 figures, 5 tables, 2 algorithms)

This paper contains 30 sections, 1 equation, 8 figures, 5 tables, 2 algorithms.

Figures (8)

  • Figure 1: PennyLane Lightning template module architecture. Each Lightning backend device follows the same architectural design, with differences in the implemented gate set, observables, and compile-time targets. The common modules allow us to easily implement functionality across the package ecosystem and allow specialization per backend target.
  • Figure 2: AVX-512 kernel stages for applying an IsingXX gate across the intra-intra (coefficients all reside within a single register) qubit indices. Data is loaded into the registers, a permuted copy is created in a separate register, and the required trigonometric values are elementwise multiplied to produce the desired gate kernel outputs, which are then summed together. Upon completion, the register data overwrites the associated indices in main memory, and the process repeats on the next set of coefficients.
  • Figure 3: Performance results for the Lightning-Qubit RX gate kernels, comparing the default (LM), AVX2, AVX-512 and AVX-512+streaming kernel implementations across gate indices and OpenMP threads for a 30-qubit state vector. The kernels show improved performance from default to AVX2, and again to AVX-512 for the one and four and 16-threaded workloads, with the default LM kernels taking the lead elsewhere. The addition of streaming operations to the AVX-512 kernels shows advantage at higher thread counts, with the performance between the LM and AVX2 kernels for lower counts.
  • Figure 4: A comparison of the runtime averaged gate performance for Hadamard (top), RX (middle), and CNOT (bottom) across a variety of high-performance quantum simulator frameworks for a 30-qubit state vector. The AVX-512 kernels for Lightning-Qubit show an advantage for a single thread, as well as many of the multithreaded regimes, with the advantage tapering off for the highest count. For higher thread-counts, enabling the streaming AVX-512 operations recovers the performance advantage.
  • Figure 5: The time to compute the total energy and its gradient 10 times for various molecules. The number of qubits and the number of terms in the molecular Hamiltonian for each molecule are shown in parentheses.
  • ...and 3 more figures