Table of Contents
Fetching ...

Achieving the Asymptotically Optimal Sample Complexity of Offline Reinforcement Learning: A DRO-Based Approach

Yue Wang, Jinjun Xiong, Shaofeng Zou

TL;DR

This paper directly model the uncertainty in the transition kernel and construct an uncertainty set of statistically plausible transition kernels and shows that the policy that optimizes the worst-case performance over this uncertainty set has a near-optimal performance in the underlying problem.

Abstract

Offline reinforcement learning aims to learn from pre-collected datasets without active exploration. This problem faces significant challenges, including limited data availability and distributional shifts. Existing approaches adopt a pessimistic stance towards uncertainty by penalizing rewards of under-explored state-action pairs to estimate value functions conservatively. In this paper, we show that the distributionally robust optimization (DRO) based approach can also address these challenges and is {asymptotically minimax optimal}. Specifically, we directly model the uncertainty in the transition kernel and construct an uncertainty set of statistically plausible transition kernels. We then show that the policy that optimizes the worst-case performance over this uncertainty set has a near-optimal performance in the underlying problem. We first design a metric-based distribution-based uncertainty set such that with high probability the true transition kernel is in this set. We prove that to achieve a sub-optimality gap of $ε$, the sample complexity is $\mathcal{O}(S^2C^{π^*}ε^{-2}(1-γ)^{-4})$, where $γ$ is the discount factor, $S$ is the number of states, and $C^{π^*}$ is the single-policy clipped concentrability coefficient which quantifies the distribution shift. To achieve the optimal sample complexity, we further propose a less conservative value-function-based uncertainty set, which, however, does not necessarily include the true transition kernel. We show that an improved sample complexity of $\mathcal{O}(SC^{π^*}ε^{-2}(1-γ)^{-3})$ can be obtained, which asymptotically matches with the minimax lower bound for offline reinforcement learning, and thus is asymptotically minimax optimal.

Achieving the Asymptotically Optimal Sample Complexity of Offline Reinforcement Learning: A DRO-Based Approach

TL;DR

This paper directly model the uncertainty in the transition kernel and construct an uncertainty set of statistically plausible transition kernels and shows that the policy that optimizes the worst-case performance over this uncertainty set has a near-optimal performance in the underlying problem.

Abstract

Offline reinforcement learning aims to learn from pre-collected datasets without active exploration. This problem faces significant challenges, including limited data availability and distributional shifts. Existing approaches adopt a pessimistic stance towards uncertainty by penalizing rewards of under-explored state-action pairs to estimate value functions conservatively. In this paper, we show that the distributionally robust optimization (DRO) based approach can also address these challenges and is {asymptotically minimax optimal}. Specifically, we directly model the uncertainty in the transition kernel and construct an uncertainty set of statistically plausible transition kernels. We then show that the policy that optimizes the worst-case performance over this uncertainty set has a near-optimal performance in the underlying problem. We first design a metric-based distribution-based uncertainty set such that with high probability the true transition kernel is in this set. We prove that to achieve a sub-optimality gap of , the sample complexity is , where is the discount factor, is the number of states, and is the single-policy clipped concentrability coefficient which quantifies the distribution shift. To achieve the optimal sample complexity, we further propose a less conservative value-function-based uncertainty set, which, however, does not necessarily include the true transition kernel. We show that an improved sample complexity of can be obtained, which asymptotically matches with the minimax lower bound for offline reinforcement learning, and thus is asymptotically minimax optimal.
Paper Structure (19 sections, 13 theorems, 116 equations, 1 figure, 1 table, 2 algorithms)

This paper contains 19 sections, 13 theorems, 116 equations, 1 figure, 1 table, 2 algorithms.

Key Result

Lemma 1

With probability at least $1-\delta$, it holds that for any $s,a$, $\mathsf P^a_s\in\hat{\mathcal{P}}^a_s$, i.e., $\|\mathsf P^a_s-\hat{\mathsf P}^a_s\|\leq 2R^a_s$In this paper, unless stated otherwise, we denote the $l_1$-norm by $\|\cdot \|$..

Figures (1)

  • Figure 1: Sub-optimality gaps of Robust DP, LCB approach, and Non-robust DP.

Theorems & Definitions (25)

  • Definition 1
  • Lemma 1
  • Theorem 1
  • Remark 1
  • Remark 2
  • Remark 3
  • Theorem 2
  • Theorem 3
  • proof
  • Lemma 2
  • ...and 15 more