Taming "data-hungry" reinforcement learning? Stability in continuous state-action spaces
Yaqi Duan, Martin J. Wainwright
TL;DR
This paper introduces a stability-based framework for reinforcement learning in continuous state-action spaces and proves fast convergence rates for both offline and online settings by linking value sub-optimality to squared Bellman residuals. Central to the approach are two stability properties—Bellman stability and occupation-measure stability—that control how perturbations in value functions and policies affect downstream updates, enabling $O(1/n)$ offline rates and $O( ext{log }T)$ online regret under suitable conditions. The authors develop a concrete, curvature-driven analysis for linear function approximation, derive fast-rate results for FQI with ridge penalties, and discuss how these results inform the roles of pessimism and optimism, as well as connections to transfer learning under covariate shift. These insights offer a principled perspective on when pessimistic/optimistic strategies are necessary and demonstrate substantial practical implications for data-efficient RL in continuous domains.
Abstract
We introduce a novel framework for analyzing reinforcement learning (RL) in continuous state-action spaces, and use it to prove fast rates of convergence in both off-line and on-line settings. Our analysis highlights two key stability properties, relating to how changes in value functions and/or policies affect the Bellman operator and occupation measures. We argue that these properties are satisfied in many continuous state-action Markov decision processes, and demonstrate how they arise naturally when using linear function approximation methods. Our analysis offers fresh perspectives on the roles of pessimism and optimism in off-line and on-line RL, and highlights the connection between off-line RL and transfer learning.
