Table of Contents
Fetching ...

Warm-Start Variational Quantum Policy Iteration

Nico Meyer, Jakob Murauer, Alexander Popov, Christian Ufrecht, Axel Plinge, Christopher Mutschler, Daniel D. Scherer

TL;DR

This paper tackles speeding up reinforcement-learning policy iteration by embedding a variational, quantum-enhanced linear-system solver into the policy evaluation step. The proposed VarQPI framework, augmented with warm-start initialization (WS-VarQPI), leverages a variational LSE solver and $\ell_{\infty}$-tomography for efficient quantum-assisted policy evaluation and classical policy improvement. Empirical evidence on FrozenLake environments shows robust performance, with WS-VarQPI achieving notable reductions in training steps and enabling up-scaling to larger problems (e.g., $256\times256$ linear systems) while maintaining ground-truth accuracy. The work analyzes the sparsity and conditioning of typical RL-induced systems, argues for practical quantum advantage under realistic constraints, and highlights hardware-related bottlenecks and directions for future validation on quantum devices.

Abstract

Reinforcement learning is a powerful framework aiming to determine optimal behavior in highly complex decision-making scenarios. This objective can be achieved using policy iteration, which requires to solve a typically large linear system of equations. We propose the variational quantum policy iteration (VarQPI) algorithm, realizing this step with a NISQ-compatible quantum-enhanced subroutine. Its scalability is supported by an analysis of the structure of generic reinforcement learning environments, laying the foundation for potential quantum advantage with utility-scale quantum computers. Furthermore, we introduce the warm-start initialization variant (WS-VarQPI) that significantly reduces resource overhead. The algorithm solves a large FrozenLake environment with an underlying 256x256-dimensional linear system, indicating its practical robustness.

Warm-Start Variational Quantum Policy Iteration

TL;DR

This paper tackles speeding up reinforcement-learning policy iteration by embedding a variational, quantum-enhanced linear-system solver into the policy evaluation step. The proposed VarQPI framework, augmented with warm-start initialization (WS-VarQPI), leverages a variational LSE solver and -tomography for efficient quantum-assisted policy evaluation and classical policy improvement. Empirical evidence on FrozenLake environments shows robust performance, with WS-VarQPI achieving notable reductions in training steps and enabling up-scaling to larger problems (e.g., linear systems) while maintaining ground-truth accuracy. The work analyzes the sparsity and conditioning of typical RL-induced systems, argues for practical quantum advantage under realistic constraints, and highlights hardware-related bottlenecks and directions for future validation on quantum devices.

Abstract

Reinforcement learning is a powerful framework aiming to determine optimal behavior in highly complex decision-making scenarios. This objective can be achieved using policy iteration, which requires to solve a typically large linear system of equations. We propose the variational quantum policy iteration (VarQPI) algorithm, realizing this step with a NISQ-compatible quantum-enhanced subroutine. Its scalability is supported by an analysis of the structure of generic reinforcement learning environments, laying the foundation for potential quantum advantage with utility-scale quantum computers. Furthermore, we introduce the warm-start initialization variant (WS-VarQPI) that significantly reduces resource overhead. The algorithm solves a large FrozenLake environment with an underlying 256x256-dimensional linear system, indicating its practical robustness.
Paper Structure (15 sections, 4 theorems, 32 equations, 6 figures, 1 table)

This paper contains 15 sections, 4 theorems, 32 equations, 6 figures, 1 table.

Key Result

Proposition 1

For a square matrix $A \in \mathbb{C}^{N \times N}$ it holds assuming the denominator on the right side is positive.

Figures (6)

  • Figure 1: Proposed routine of variational quantum policy iteration. The evaluation of the state-action value function $Q_{\pi}$ is formulated as a $A \bm{x} = \bm{b}$. The solution is computed variationally Bravo_2023, where the blue sub-circuit prepares a state proportional to $Q_{\pi}$. Classically efficient $\ell_{\infty}$-tomography is used for policy improvement. This procedure is iterated until an optimal policy $\pi_*$ is found. After random initialization, warm-start variational parameters $\bm{\alpha}$ are carried over from the previous iteration, leading to faster policy evaluation.
  • Figure 2: Loss curves for one instance of (left plot) and randomly initialized (right plot) on FrozenLake with $\beta=0.1$. The vertical dotted lines indicate one iteration of policy evaluation with loss threshold $0.0001$. The procedure continues with the updated after policy improvement -- with the previously optimal parameters, or the initial parameters from the first iteration. The large variance in the training procedure can be attributed to the use of a static learning rate of $0.01$. While learning rate schedulers might be used to reduce the fluctuations, the required additional hyperparameter tuning was found to be non-trivial for the proof-of-concept realization.
  • Figure 3: The policy found for the FrozenLake8x8 environment with $\beta=0.1$-stochasticity using . Training required $82160$ steps in $9$ policy evaluation and improvement cycles. The experiments use a depth $d=24$ ansatz and a loss threshold of $0.0001$. The solution perfectly aligns with the ground truth for this configuration.
  • Figure 4: Success rates of variational quantum policy evaluation with a depth $12$ circuit for different loss thresholds and therefore required training steps. The results are averaged over $1000$ random instances of FrozenLake with varying stochasticity $\beta$ -- the bands in the lower plot denoting standard deviations. Training is terminated once the loss value $C_{G}$ has declined below the given threshold. The success rate denotes the percentage of runs for which \ref{['eq:greedy_action_selection']} agrees with the ground truth. To account for sampling noise, we tolerate a deviation of $0.001$ in the associated $Q$-values.
  • Figure 5: Success rates of variational quantum policy evaluation for different circuit depths and therefore required training steps. The results are averaged over $1000$ random instances of FrozenLake with varying stochasticity $\beta$ -- the bands in the lower plot denoting standard deviations. The success rates denotes the percentage of runs for which \ref{['eq:greedy_action_selection']} agrees with the ground truth. To account for sampling noise, we tolerate a deviation of $0.001$ in the associated $Q$-values. The percentages in the lower plot denote the ration of runs below $100\%$ that achieved the loss threshold $0.0001$ with less then $10000$ steps, after which training is aborted.
  • ...and 1 more figures

Theorems & Definitions (13)

  • Definition 1: local dynamics
  • Definition 2: deterministic dynamics
  • Definition 3: uniform local dynamics
  • Definition 4: exponential local dynamics
  • Proposition 1
  • proof
  • proof
  • Proposition 2
  • proof
  • Theorem 1
  • ...and 3 more