Cache Hierarchy and Vectorization Analysis of Lindblad Master Equation Simulation for Near-Term Quantum Control

Rylan Malarchick

Cache Hierarchy and Vectorization Analysis of Lindblad Master Equation Simulation for Near-Term Quantum Control

Rylan Malarchick

Abstract

Simulation of open quantum systems via the Lindblad master equation is a computational bottleneck in near-term quantum control workflows, including optimal pulse engineering (GRAPE), trajectory-based robustness analysis, and feedback controller design. For the system sizes relevant to near-term quantum control ($d = 3$ for a single transmon with leakage, $d = 9$ for two-qubit, and $d = 27$ for three-qubit), the dominant cost per timestep is a $(d^2 \times d^2)$ complex matrix-vector multiplication: a $9\times9$, $81\times81$, or $729\times729$ dense matvec, respectively. The working set sizes (1.5 KB, 105 KB, and 8.1 MB) straddle the L1, L2, and L3 cache boundaries of modern CPUs, making this an ideal system for cache-hierarchy performance analysis. We characterize the arithmetic intensity ($\approx 1/2$ FLOP/byte in the large-$d$ limit), construct a Roofline model for the propagation kernel, and systematically vary compiler flags and data layout to isolate the contributions of auto-vectorization, fused multiply-add, and structure-of-arrays (SoA) memory layout. We show that SoA layout combined with -O3 -march=native -ffast-math yields $2$--$4\times$ speedup over scalar array-of-structures baselines, and that -ffast-math is essential for enabling GCC auto-vectorization of complex arithmetic. These results motivate a set of concrete recommendations for authors of quantum simulation libraries targeting near-term system sizes.

Cache Hierarchy and Vectorization Analysis of Lindblad Master Equation Simulation for Near-Term Quantum Control

Abstract

for a single transmon with leakage,

for two-qubit, and

for three-qubit), the dominant cost per timestep is a

complex matrix-vector multiplication: a

, or

dense matvec, respectively. The working set sizes (1.5 KB, 105 KB, and 8.1 MB) straddle the L1, L2, and L3 cache boundaries of modern CPUs, making this an ideal system for cache-hierarchy performance analysis. We characterize the arithmetic intensity (

FLOP/byte in the large-

limit), construct a Roofline model for the propagation kernel, and systematically vary compiler flags and data layout to isolate the contributions of auto-vectorization, fused multiply-add, and structure-of-arrays (SoA) memory layout. We show that SoA layout combined with -O3 -march=native -ffast-math yields

speedup over scalar array-of-structures baselines, and that -ffast-math is essential for enabling GCC auto-vectorization of complex arithmetic. These results motivate a set of concrete recommendations for authors of quantum simulation libraries targeting near-term system sizes.

Paper Structure (16 sections, 8 equations, 2 figures, 4 tables)

This paper contains 16 sections, 8 equations, 2 figures, 4 tables.

Introduction
Background
The Lindblad propagation kernel
Cache hierarchy and the Roofline model
Compiler auto-vectorization of complex matvec
Methods
C library implementation
Compiler flag experiments
Benchmarking methodology
Results
Roofline characterization
Compiler vectorization analysis
Discussion
Recommendations for quantum simulation library authors
Conclusion
...and 1 more sections

Figures (2)

Figure 1: Achieved bandwidth (GB/s) for each layout variant at -O3 -march=native -ffast-math, grouped by system dimension. The $x$-axis labels indicate the cache level that holds the working set. SoA outperforms both AoS and hand-written AVX2 at every size. Bandwidth decreases from L1 to L3, consistent with the cache hierarchy.
Figure 2: Achieved bandwidth (GB/s) for the SoA variant across compiler flag configurations and system dimensions. The -ffast-math flag produces the dominant speedup. Note the anomalous regression at $d = 9$ when adding -march=native without -ffast-math (see text).

Cache Hierarchy and Vectorization Analysis of Lindblad Master Equation Simulation for Near-Term Quantum Control

Abstract

Cache Hierarchy and Vectorization Analysis of Lindblad Master Equation Simulation for Near-Term Quantum Control

Authors

Abstract

Table of Contents

Figures (2)