Warm-Start Variational Quantum Policy Iteration

Nico Meyer; Jakob Murauer; Alexander Popov; Christian Ufrecht; Axel Plinge; Christopher Mutschler; Daniel D. Scherer

Warm-Start Variational Quantum Policy Iteration

Nico Meyer, Jakob Murauer, Alexander Popov, Christian Ufrecht, Axel Plinge, Christopher Mutschler, Daniel D. Scherer

TL;DR

This paper tackles speeding up reinforcement-learning policy iteration by embedding a variational, quantum-enhanced linear-system solver into the policy evaluation step. The proposed VarQPI framework, augmented with warm-start initialization (WS-VarQPI), leverages a variational LSE solver and $\ell_{\infty}$-tomography for efficient quantum-assisted policy evaluation and classical policy improvement. Empirical evidence on FrozenLake environments shows robust performance, with WS-VarQPI achieving notable reductions in training steps and enabling up-scaling to larger problems (e.g., $256\times256$ linear systems) while maintaining ground-truth accuracy. The work analyzes the sparsity and conditioning of typical RL-induced systems, argues for practical quantum advantage under realistic constraints, and highlights hardware-related bottlenecks and directions for future validation on quantum devices.

Abstract

Reinforcement learning is a powerful framework aiming to determine optimal behavior in highly complex decision-making scenarios. This objective can be achieved using policy iteration, which requires to solve a typically large linear system of equations. We propose the variational quantum policy iteration (VarQPI) algorithm, realizing this step with a NISQ-compatible quantum-enhanced subroutine. Its scalability is supported by an analysis of the structure of generic reinforcement learning environments, laying the foundation for potential quantum advantage with utility-scale quantum computers. Furthermore, we introduce the warm-start initialization variant (WS-VarQPI) that significantly reduces resource overhead. The algorithm solves a large FrozenLake environment with an underlying 256x256-dimensional linear system, indicating its practical robustness.

Warm-Start Variational Quantum Policy Iteration

TL;DR

-tomography for efficient quantum-assisted policy evaluation and classical policy improvement. Empirical evidence on FrozenLake environments shows robust performance, with WS-VarQPI achieving notable reductions in training steps and enabling up-scaling to larger problems (e.g.,

linear systems) while maintaining ground-truth accuracy. The work analyzes the sparsity and conditioning of typical RL-induced systems, argues for practical quantum advantage under realistic constraints, and highlights hardware-related bottlenecks and directions for future validation on quantum devices.

Abstract

Paper Structure (15 sections, 4 theorems, 32 equations, 6 figures, 1 table)

This paper contains 15 sections, 4 theorems, 32 equations, 6 figures, 1 table.

Introduction.
Preliminaries and Method Description
Direct Policy Iteration
Variational LSE Solvers
Variational Quantum Policy Iteration with Warm Start
Experimental End-to-End Realization
Warm-Start Parameter Initialization
Up-Scaling to Larger Environment
Resource Analysis of Quantum Subroutine
Loss Threshold and Environment Stochasticity
Depth of Variational Ansatz
Conclusion
Sparsity for Local Dynamics
Bounds on Condition Number
Notes on Unitary Decomposition

Key Result

Proposition 1

For a square matrix $A \in \mathbb{C}^{N \times N}$ it holds assuming the denominator on the right side is positive.

Figures (6)

Figure 1: Proposed routine of variational quantum policy iteration. The evaluation of the state-action value function $Q_{\pi}$ is formulated as a $A \bm{x} = \bm{b}$. The solution is computed variationally Bravo_2023, where the blue sub-circuit prepares a state proportional to $Q_{\pi}$. Classically efficient $\ell_{\infty}$-tomography is used for policy improvement. This procedure is iterated until an optimal policy $\pi_*$ is found. After random initialization, warm-start variational parameters $\bm{\alpha}$ are carried over from the previous iteration, leading to faster policy evaluation.
Figure 2: Loss curves for one instance of (left plot) and randomly initialized (right plot) on FrozenLake with $\beta=0.1$. The vertical dotted lines indicate one iteration of policy evaluation with loss threshold $0.0001$. The procedure continues with the updated after policy improvement -- with the previously optimal parameters, or the initial parameters from the first iteration. The large variance in the training procedure can be attributed to the use of a static learning rate of $0.01$. While learning rate schedulers might be used to reduce the fluctuations, the required additional hyperparameter tuning was found to be non-trivial for the proof-of-concept realization.
Figure 3: The policy found for the FrozenLake8x8 environment with $\beta=0.1$-stochasticity using . Training required $82160$ steps in $9$ policy evaluation and improvement cycles. The experiments use a depth $d=24$ ansatz and a loss threshold of $0.0001$. The solution perfectly aligns with the ground truth for this configuration.
Figure 4: Success rates of variational quantum policy evaluation with a depth $12$ circuit for different loss thresholds and therefore required training steps. The results are averaged over $1000$ random instances of FrozenLake with varying stochasticity $\beta$ -- the bands in the lower plot denoting standard deviations. Training is terminated once the loss value $C_{G}$ has declined below the given threshold. The success rate denotes the percentage of runs for which \ref{['eq:greedy_action_selection']} agrees with the ground truth. To account for sampling noise, we tolerate a deviation of $0.001$ in the associated $Q$-values.
Figure 5: Success rates of variational quantum policy evaluation for different circuit depths and therefore required training steps. The results are averaged over $1000$ random instances of FrozenLake with varying stochasticity $\beta$ -- the bands in the lower plot denoting standard deviations. The success rates denotes the percentage of runs for which \ref{['eq:greedy_action_selection']} agrees with the ground truth. To account for sampling noise, we tolerate a deviation of $0.001$ in the associated $Q$-values. The percentages in the lower plot denote the ration of runs below $100\%$ that achieved the loss threshold $0.0001$ with less then $10000$ steps, after which training is aborted.
...and 1 more figures

Theorems & Definitions (13)

Definition 1: local dynamics
Definition 2: deterministic dynamics
Definition 3: uniform local dynamics
Definition 4: exponential local dynamics
Proposition 1
proof
proof
Proposition 2
proof
Theorem 1
...and 3 more

Warm-Start Variational Quantum Policy Iteration

TL;DR

Abstract

Warm-Start Variational Quantum Policy Iteration

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (13)