Model-based Offline Quantum Reinforcement Learning
Simon Eisenmann, Daniel Hein, Steffen Udluft, Thomas A. Runkler
TL;DR
This work tackles offline reinforcement learning in the quantum domain by introducing a model-based approach that uses a variational quantum circuit (VQC) as a surrogate dynamics model learned from pre-recorded data. A second VQC serves as the policy, with parameters optimized by a gradient-free PSO to maximize horizon-based returns estimated through model rollouts, $\mathcal{R}^{\pi_\omega}(s_0)=\sum_{t=1}^{H} r(\tilde{s}_t)$, where $\tilde{s}_{t+1}=m(s_t,a_t)$ and $a_t=\pi_\omega(s_t)$. The method is demonstrated on the CartPole task, achieving multiple perfect policies and showing that the surrogate model can enable effective policy search despite being trained only on offline data. Key analyses explore data re-uploading and data efficiency, finding substantial gains from re-uploading and showing that VQCs currently lag classical NNs in data efficiency but remain capable of guiding successful policy optimization. The study highlights the potential for quantum advantages as quantum hardware scales, while acknowledging current hardware limitations and proposing future fully quantum policy-search pathways, including Grover-based optimization ideas.
Abstract
This paper presents the first algorithm for model-based offline quantum reinforcement learning and demonstrates its functionality on the cart-pole benchmark. The model and the policy to be optimized are each implemented as variational quantum circuits. The model is trained by gradient descent to fit a pre-recorded data set. The policy is optimized with a gradient-free optimization scheme using the return estimate given by the model as the fitness function. This model-based approach allows, in principle, full realization on a quantum computer during the optimization phase and gives hope that a quantum advantage can be achieved as soon as sufficiently powerful quantum computers are available.
