Reinforcement Learning for Fast and Robust Longitudinal Qubit Readout

Yiming Yu; Yuan Qiu; Xinyu Zhao; Ye-Hong Chen; Yan Xia

Reinforcement Learning for Fast and Robust Longitudinal Qubit Readout

Yiming Yu, Yuan Qiu, Xinyu Zhao, Ye-Hong Chen, Yan Xia

Abstract

Longitudinal coupling offers a compelling pathway for quantum nondemolition (QND) readout, but pulse design is constrained by hardware limitations such as the coupling strength and the photon number required to stay within the linear regime. We develop a reinforcement learning framework to optimize the longitudinal coupling waveform under such constraints. Building upon the theoretical foundation of shortcuts to adiabaticity (STA), we parameterize an auxiliary trajectory with cubic B-splines and reconstruct the physical control. At a fixed short readout time, the optimized pulse converges to a constraint saturating flat-top protocol and yields a approximately $50\%$ improvement in $\mathrm{SNR}$ over an STA baseline, while exhibiting enhanced robustness to parameter drifts. Simulation results demonstrate the efficacy of reinforcement learning in optimizing longitudinal readout pulses. The optimized protocol attains substantial performance gains and yields smooth, hardware-compatible waveforms governed by an interpretable ``saturate-and-hold'' mechanism.

Reinforcement Learning for Fast and Robust Longitudinal Qubit Readout

Abstract

improvement in

over an STA baseline, while exhibiting enhanced robustness to parameter drifts. Simulation results demonstrate the efficacy of reinforcement learning in optimizing longitudinal readout pulses. The optimized protocol attains substantial performance gains and yields smooth, hardware-compatible waveforms governed by an interpretable ``saturate-and-hold'' mechanism.

Paper Structure (14 sections, 34 equations, 5 figures)

This paper contains 14 sections, 34 equations, 5 figures.

INTRODUCTION
Effective model and control parameterization
Inverse Engineering Control Model for Longitudinal Readout
Signal Generation and Hardware-Constrained Optimization Landscape
Reinforcement Learning Framework with Physics-Based Parameterization
Numerical Simulations and Results
Discussion and conclusions
Derivation of the Inverse Engineering Control Law
Equivalence of the semi-classical and quantum SNR definitions
Quantum homodyne observable and integrated measurement record
Signal separation (numerator)
Noise scaling and the effective noise spectral density $S_{\mathrm{eff}}$ (denominator)
Result: recovery of the semi-classical SNR expression
Sample-efficiency benchmark via iterations-to-target SNR

Figures (5)

Figure 1: Schematic of the physics-based reinforcement learning framework for optimizing longitudinal qubit readout. (a) Physical implementation of the longitudinal readout. A qubit (central sphere) is coupled to a resonator mode with frequency $\omega_r$ via a time-dependent longitudinal interaction $g_z(t)\sigma_z(\hat{a} + \hat{a}^\dagger)$. Depending on the qubit state $|e\rangle$ (red) or $|g\rangle$ (blue), the cavity field is displaced to symmetric conditional pointer states $+\alpha$ and $-\alpha$ in the phase space, spanned by the expectation value of the annihilation operator $\langle \hat{a} \rangle$. (b) The closed-loop optimization control flow. The PPO agent outputs a set of coefficients $\mathbf{c}=\{c_k\}$ to construct the auxiliary control trajectory $g_c(t)$ using cubic B-spline basis functions $B_k(x)$. The physical control pulse $g_z(t)$ is then derived via inverse engineering constraints to ensure smooth boundary conditions. The system performance is evaluated to generate a reward $R$, which maximizes the signal-to-noise ratio (SNR) at the readout time $t_f$ while penalizing physical constraint violations.
Figure 2: Physics-seeded PPO: training dynamics and readout performance. (a) Mean total reward $R(\mathbf{c})$ (black solid) and mean SNR reward $R_{\mathrm{SNR}}(\mathbf{c})$ (red solid) versus training iteration. The dashed and dotted lines show the mean penalty terms $P_{\mathrm{area}}$ and $P_{N}$, respectively. Shaded areas represent the standard deviation across independent training runs. (b) Colors indicate the optimization iteration of $g_c(t)$ (from early purple to late yellow), showing the evolution of the normalized control pulse $g_c(t)/\omega_r$ over the normalized time $t/t_f$. (c) Auxiliary envelope $g_c(t)$ (solid lines) used to drive the effective mean-field dynamics, and the corresponding physical coupling $g_z(t)$ (markers) reconstructed via Eq. (\ref{['eq:inverse_engineering']}) in Sec. \ref{['s2.1']}. Blue denotes the STA seed and red denotes the PPO result. (d) Intracavity photon number $N(t)=|\alpha(t)|^2$ for the STA seed (blue dashed) and the PPO protocol (red solid). The horizontal reference indicates the imposed photon-number limit $N_{\mathrm{max}}=50$. (e) Cumulative SNR of the normalized readout time $t/t_f$ for the STA seed (blue line) and the PPO-optimized pulse (red line). (f) Histograms of the integrated homodyne record $I_m$ for the qubit states $|g\rangle$ and $|e\rangle$. Blue dashed lines and red solid lines (with shaded fill) correspond to the distributions obtained using the STA seed and the PPO protocol, respectively.
Figure 3: Scalability of the final-time SNR under dual hardware constraints. Final $\mathrm{SNR}(t_f)$ as a function of the maximum allowable longitudinal coupling $g_{z,\max}/\omega_r$. Different colors correspond to photon-number caps $N_{\max}\in\{30,40,50\}$. Solid lines: PPO-optimized protocols. Dashed lines: analytical STA baseline.
Figure 4: Worst-case robustness to timing and amplitude errors. The upper surface corresponds to the PPO-optimized pulse and the lower surface to the STA seed. The final SNR is evaluated under bounded uncertainties in a timing error $|\Delta t|/t_f|$ (x-axis) and a multiplicative amplitude error $|\Delta A|/A|$ (y-axis). For each error bound, we report the worst-case SNR, defined as the minimum SNR over all error realizations within $\delta_t\in[-|\Delta t|,|\Delta t|]$ and $\delta_A\in[-|\Delta A|,|\Delta A|]$.
Figure S1: Mean PPO iterations required to reach a target SNR. Bars compare seeded and no-seed initialization across several target SNR thresholds. Seeded initialization consistently reduces the iterations-to-target, demonstrating improved sample efficiency.

Reinforcement Learning for Fast and Robust Longitudinal Qubit Readout

Abstract

Reinforcement Learning for Fast and Robust Longitudinal Qubit Readout

Authors

Abstract

Table of Contents

Figures (5)