Table of Contents
Fetching ...

Towards Optimal Offline Reinforcement Learning

Mengmeng Li, Daniel Kuhn, Tobias Sutter

TL;DR

This work tackles offline reinforcement learning with an infinite horizon long-run average reward objective using a single correlated trajectory. It builds a distributionally robust framework based on a large deviations rate function to construct a non rectangular uncertainty set for the state action next state distribution and introduces a distribution shift to map distributions to the evaluation policy regime. The authors prove that the worst case average reward under this shifted set yields a statistically efficient estimator with exponential out-of-sample disappointment decay and develop an actionable actor-critic style algorithm to solve the resulting robust MDP. They demonstrate the practical effectiveness of the approach on GridWorld OPE and a machine replacement offline planning problem, showing competitive or superior performance to existing baselines, especially in finite data regimes. The work advances a principled, statistically efficient route for offline RL under distribution shifts and non rectangular uncertainty, with tractable approximate algorithms for robust offline optimization.

Abstract

We study offline reinforcement learning problems with a long-run average reward objective. The state-action pairs generated by any fixed behavioral policy thus follow a Markov chain, and the {\em empirical} state-action-next-state distribution satisfies a large deviations principle. We use the rate function of this large deviations principle to construct an uncertainty set for the unknown {\em true} state-action-next-state distribution. We also construct a distribution shift transformation that maps any distribution in this uncertainty set to a state-action-next-state distribution of the Markov chain generated by a fixed evaluation policy, which may differ from the unknown behavioral policy. We prove that the worst-case average reward of the evaluation policy with respect to all distributions in the shifted uncertainty set provides, in a rigorous statistical sense, the least conservative estimator for the average reward under the unknown true distribution. This guarantee is available even if one has only access to one single trajectory of serially correlated state-action pairs. The emerging robust optimization problem can be viewed as a robust Markov decision process with a non-rectangular uncertainty set. We adapt an efficient policy gradient algorithm to solve this problem. Numerical experiments show that our methods compare favorably against state-of-the-art methods.

Towards Optimal Offline Reinforcement Learning

TL;DR

This work tackles offline reinforcement learning with an infinite horizon long-run average reward objective using a single correlated trajectory. It builds a distributionally robust framework based on a large deviations rate function to construct a non rectangular uncertainty set for the state action next state distribution and introduces a distribution shift to map distributions to the evaluation policy regime. The authors prove that the worst case average reward under this shifted set yields a statistically efficient estimator with exponential out-of-sample disappointment decay and develop an actionable actor-critic style algorithm to solve the resulting robust MDP. They demonstrate the practical effectiveness of the approach on GridWorld OPE and a machine replacement offline planning problem, showing competitive or superior performance to existing baselines, especially in finite data regimes. The work advances a principled, statistically efficient route for offline RL under distribution shifts and non rectangular uncertainty, with tractable approximate algorithms for robust offline optimization.

Abstract

We study offline reinforcement learning problems with a long-run average reward objective. The state-action pairs generated by any fixed behavioral policy thus follow a Markov chain, and the {\em empirical} state-action-next-state distribution satisfies a large deviations principle. We use the rate function of this large deviations principle to construct an uncertainty set for the unknown {\em true} state-action-next-state distribution. We also construct a distribution shift transformation that maps any distribution in this uncertainty set to a state-action-next-state distribution of the Markov chain generated by a fixed evaluation policy, which may differ from the unknown behavioral policy. We prove that the worst-case average reward of the evaluation policy with respect to all distributions in the shifted uncertainty set provides, in a rigorous statistical sense, the least conservative estimator for the average reward under the unknown true distribution. This guarantee is available even if one has only access to one single trajectory of serially correlated state-action pairs. The emerging robust optimization problem can be viewed as a robust Markov decision process with a non-rectangular uncertainty set. We adapt an efficient policy gradient algorithm to solve this problem. Numerical experiments show that our methods compare favorably against state-of-the-art methods.

Paper Structure

This paper contains 16 sections, 26 theorems, 91 equations, 2 figures, 1 table, 2 algorithms.

Key Result

Theorem 2.2

For all $\theta \in \Theta_0$ and Borel sets $\mathcal{D} \subseteq \Theta$, the empirical doublet distribution $\widehat{\theta}_{T}$ defined in MC:estimator satisfies

Figures (2)

  • Figure 1: Scatter plot of the average reward predicted by different estimators against the empirical out-of-sample disappointment $\widehat{\beta}$. Points correspond to different hyperparameter values from Table \ref{['tab:params']}.
  • Figure 2: Frequencies at which each of the three policy estimators achieves the highest long-run average reward across $100$ independent simulation runs, as a function of $T$.

Theorems & Definitions (33)

  • Definition 2.1: Conditional relative entropy for Markov chains
  • Theorem 2.2: Large deviations principle for Markov chains
  • Corollary 2.3: Finite-sample version of Theorem \ref{['thm:LDP']}
  • Proposition 2.4: State-action process $\{X_t\}_{t=1}^\infty$
  • Lemma 2.5
  • Definition 2.6: Conditional relative entropy for MDPs
  • Proposition 2.7: Relation between $\mathsf{D}_{\mathsf{mc}}$ and $\mathsf{D}_{\mathsf{mdp}}$
  • Theorem 2.8: Large deviations principle for MDPs
  • Corollary 2.9: Finite-sample version of Theorem \ref{['thm:LDP:q']}
  • Lemma 2.10: Properties of $\mathsf{D}_{\mathsf{mdp}}$
  • ...and 23 more