Towards Optimal Offline Reinforcement Learning
Mengmeng Li, Daniel Kuhn, Tobias Sutter
TL;DR
This work tackles offline reinforcement learning with an infinite horizon long-run average reward objective using a single correlated trajectory. It builds a distributionally robust framework based on a large deviations rate function to construct a non rectangular uncertainty set for the state action next state distribution and introduces a distribution shift to map distributions to the evaluation policy regime. The authors prove that the worst case average reward under this shifted set yields a statistically efficient estimator with exponential out-of-sample disappointment decay and develop an actionable actor-critic style algorithm to solve the resulting robust MDP. They demonstrate the practical effectiveness of the approach on GridWorld OPE and a machine replacement offline planning problem, showing competitive or superior performance to existing baselines, especially in finite data regimes. The work advances a principled, statistically efficient route for offline RL under distribution shifts and non rectangular uncertainty, with tractable approximate algorithms for robust offline optimization.
Abstract
We study offline reinforcement learning problems with a long-run average reward objective. The state-action pairs generated by any fixed behavioral policy thus follow a Markov chain, and the {\em empirical} state-action-next-state distribution satisfies a large deviations principle. We use the rate function of this large deviations principle to construct an uncertainty set for the unknown {\em true} state-action-next-state distribution. We also construct a distribution shift transformation that maps any distribution in this uncertainty set to a state-action-next-state distribution of the Markov chain generated by a fixed evaluation policy, which may differ from the unknown behavioral policy. We prove that the worst-case average reward of the evaluation policy with respect to all distributions in the shifted uncertainty set provides, in a rigorous statistical sense, the least conservative estimator for the average reward under the unknown true distribution. This guarantee is available even if one has only access to one single trajectory of serially correlated state-action pairs. The emerging robust optimization problem can be viewed as a robust Markov decision process with a non-rectangular uncertainty set. We adapt an efficient policy gradient algorithm to solve this problem. Numerical experiments show that our methods compare favorably against state-of-the-art methods.
