Table of Contents
Fetching ...

Model-based Offline Quantum Reinforcement Learning

Simon Eisenmann, Daniel Hein, Steffen Udluft, Thomas A. Runkler

TL;DR

This work tackles offline reinforcement learning in the quantum domain by introducing a model-based approach that uses a variational quantum circuit (VQC) as a surrogate dynamics model learned from pre-recorded data. A second VQC serves as the policy, with parameters optimized by a gradient-free PSO to maximize horizon-based returns estimated through model rollouts, $\mathcal{R}^{\pi_\omega}(s_0)=\sum_{t=1}^{H} r(\tilde{s}_t)$, where $\tilde{s}_{t+1}=m(s_t,a_t)$ and $a_t=\pi_\omega(s_t)$. The method is demonstrated on the CartPole task, achieving multiple perfect policies and showing that the surrogate model can enable effective policy search despite being trained only on offline data. Key analyses explore data re-uploading and data efficiency, finding substantial gains from re-uploading and showing that VQCs currently lag classical NNs in data efficiency but remain capable of guiding successful policy optimization. The study highlights the potential for quantum advantages as quantum hardware scales, while acknowledging current hardware limitations and proposing future fully quantum policy-search pathways, including Grover-based optimization ideas.

Abstract

This paper presents the first algorithm for model-based offline quantum reinforcement learning and demonstrates its functionality on the cart-pole benchmark. The model and the policy to be optimized are each implemented as variational quantum circuits. The model is trained by gradient descent to fit a pre-recorded data set. The policy is optimized with a gradient-free optimization scheme using the return estimate given by the model as the fitness function. This model-based approach allows, in principle, full realization on a quantum computer during the optimization phase and gives hope that a quantum advantage can be achieved as soon as sufficiently powerful quantum computers are available.

Model-based Offline Quantum Reinforcement Learning

TL;DR

This work tackles offline reinforcement learning in the quantum domain by introducing a model-based approach that uses a variational quantum circuit (VQC) as a surrogate dynamics model learned from pre-recorded data. A second VQC serves as the policy, with parameters optimized by a gradient-free PSO to maximize horizon-based returns estimated through model rollouts, , where and . The method is demonstrated on the CartPole task, achieving multiple perfect policies and showing that the surrogate model can enable effective policy search despite being trained only on offline data. Key analyses explore data re-uploading and data efficiency, finding substantial gains from re-uploading and showing that VQCs currently lag classical NNs in data efficiency but remain capable of guiding successful policy optimization. The study highlights the potential for quantum advantages as quantum hardware scales, while acknowledging current hardware limitations and proposing future fully quantum policy-search pathways, including Grover-based optimization ideas.

Abstract

This paper presents the first algorithm for model-based offline quantum reinforcement learning and demonstrates its functionality on the cart-pole benchmark. The model and the policy to be optimized are each implemented as variational quantum circuits. The model is trained by gradient descent to fit a pre-recorded data set. The policy is optimized with a gradient-free optimization scheme using the return estimate given by the model as the fitness function. This model-based approach allows, in principle, full realization on a quantum computer during the optimization phase and gives hope that a quantum advantage can be achieved as soon as sufficiently powerful quantum computers are available.
Paper Structure (16 sections, 4 equations, 6 figures, 1 algorithm)

This paper contains 16 sections, 4 equations, 6 figures, 1 algorithm.

Figures (6)

  • Figure 1: VQC for the cart-pole surrogate model. The design of the sketch is based on the works of Chen2020 and bergholm2022pennylane.
  • Figure 2: Visualizing direct optimization of policy parameters using PSO in a model-based RL context.
  • Figure 3: VQC for the policy. The design of the sketch is based on the works of Chen2020 and bergholm2022pennylane.
  • Figure 4: Learning curves of ten VQC policy search experiments. Same colors equal same experiments. (a) Average training return of the VQC policy search on the VQC surrogate model. (b) Average evaluation return of the same policies evaluated in the cart-pole simulation. (c) Average number of environment steps of the same policies generated in the cart-pole simulation. Note that seven out of ten policies are considered perfect policies, i.e., they balanced all of the 100 randomly drawn evaluation states for the full episode of 500 steps.
  • Figure 5: VQC surrogate model experiments. The validation loss is presented. Each point depicts the average over 100 trainings, with $1\sigma$ error bars. (a) Impact of data re-uploading on the prediction accuracy of cart-pole states using a VQC. (b) Comparative analysis of data efficiency between VQCs and classical neural networks in predicting cart-pole states.
  • ...and 1 more figures