On the Statistical Complexity for Offline and Low-Adaptive Reinforcement Learning with Structures

Ming Yin; Mengdi Wang; Yu-Xiang Wang

On the Statistical Complexity for Offline and Low-Adaptive Reinforcement Learning with Structures

Ming Yin, Mengdi Wang, Yu-Xiang Wang

TL;DR

The recent advances on the statistical foundation of reinforcement learning (RL) in the offline and low-adaptive settings are reviewed and key algorithmic ideas and proof techniques behind near-optimal instance-dependent methods in OPE and OPL are covered.

Abstract

This article reviews the recent advances on the statistical foundation of reinforcement learning (RL) in the offline and low-adaptive settings. We will start by arguing why offline RL is the appropriate model for almost any real-life ML problems, even if they have nothing to do with the recent AI breakthroughs that use RL. Then we will zoom into two fundamental problems of offline RL: offline policy evaluation (OPE) and offline policy learning (OPL). It may be surprising to people that tight bounds for these problems were not known even for tabular and linear cases until recently. We delineate the differences between worst-case minimax bounds and instance-dependent bounds. We also cover key algorithmic ideas and proof techniques behind near-optimal instance-dependent methods in OPE and OPL. Finally, we discuss the limitations of offline RL and review a burgeoning problem of \emph{low-adaptive exploration} which addresses these limitations by providing a sweet middle ground between offline and online RL.

On the Statistical Complexity for Offline and Low-Adaptive Reinforcement Learning with Structures

TL;DR

Abstract

Paper Structure (27 sections, 12 theorems, 75 equations, 4 figures, 2 tables)

This paper contains 27 sections, 12 theorems, 75 equations, 4 figures, 2 tables.

Introduction
Notations and problem setup
Episodic time-inhomogenuous RL
Structured MDP models
Offline RL Tasks
Assumptions in offline RL
Offline Policy Evaluation in Contextual Bandits and Tabular RL
OPE in contextual bandits
"Curse of Horizon" in OPE for RL
OPE in Tabular MDPs
Offline Policy Evaluation with function approximation
Linear function approximation
Parametric function approximation
Offline Policy Learning in Tabular RL: Pessimism and Instance-Dependent Bounds
Pessimism is Minimax Optimal
...and 12 more sections

Key Result

Theorem 3.1

For discrete DAG MDPs with horizon $H$, the variance of any unbiased estimator $\hat{v}$ with $n$ trajectories from policy $\mu$ satisfies

Figures (4)

Figure 1: Illustration of the offline reinforcement learning problem.
Figure 2: Adopted from yin2020asymptotically. Different scaling law for TMIS, SMIS and IS for a time-inhomogenuous MDP. Relative RMSE ($\sqrt{\text{MSE}}/v^\pi$). For episode $n$, the right panel shows both TMIS and SMIS have a convergence rate of $n^{-1/2}$. For horizon $H$, the left panel shows the MSE of TMIS has the optimal dependence $O(H^2)$, while SMIS has the dependence $O(H^3)$.
Figure 3: An instance of MAB problem with pessimism being the right choice. Red choice: upper confidence bound. Blue choice: empirical risk minimizer. Green choice: lower confidence bound \ref{['eqn:lcb_mab']}. Red star denotes the true mean reward; black cross denotes the point estimator \ref{['eqn:erm']}.
Figure 4: Illustration of the problem of low-adaptive RL.

Theorems & Definitions (14)

Remark 1
Theorem 3.1: Cramer-Rao lower bound for tabular OPE jiang2016doubly
Theorem 3.2
Lemma 3.3
Theorem 3.4
Theorem 4.1
Theorem 4.2
Theorem 5.1
Theorem 6.1
Theorem 6.2: yin2022offline
...and 4 more

On the Statistical Complexity for Offline and Low-Adaptive Reinforcement Learning with Structures

TL;DR

Abstract

On the Statistical Complexity for Offline and Low-Adaptive Reinforcement Learning with Structures

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (14)