Offline Reinforcement Learning in Large State Spaces: Algorithms and Guarantees
Nan Jiang, Tengyang Xie
TL;DR
The article provides a comprehensive theory of offline reinforcement learning in large state spaces, highlighting how data coverage and function-approximation expressivity shape learnability and guarantees. It surveys three core strands: (i) value-function based methods (FQE/BRM) with Bellman-completeness and state-action coverage, (ii) pessimistic algorithms (PSPI, PEVI) that relax all-policy coverage to single-policy coverage via uncertainty quantification, and (iii) realizability-driven approaches (BVFT, LSTDQ) and MIS-based density-ratio methods for OPE and policy optimization. A unifying theme is translating Bellman errors across data distributions using coverage notions (Cπ, C^sqπ, C^avgπ) and developing oracle-efficient or computation-friendly schemes that yield non-vacuous guarantees even with offline data. The paper also links model-based offline RL, online-offline hybrids, and partial observability directions, offering a roadmap for robust, scalable offline learning in real-world large-state problems. Overall, it clarifies when and how offline RL achieves reliable policy learning under limited or biased data, and points to open problems in deep, structured, and multi-agent settings.
Abstract
This article introduces the theory of offline reinforcement learning in large state spaces, where good policies are learned from historical data without online interactions with the environment. Key concepts introduced include expressivity assumptions on function approximation (e.g., Bellman completeness vs. realizability) and data coverage (e.g., all-policy vs. single-policy coverage). A rich landscape of algorithms and results is described, depending on the assumptions one is willing to make and the sample and computational complexity guarantees one wishes to achieve. We also discuss open questions and connections to adjacent areas.
