Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data

Sunil Madhow; Dan Qiao; Ming Yin; Yu-Xiang Wang

Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data

Sunil Madhow, Dan Qiao, Ming Yin, Yu-Xiang Wang

TL;DR

This paper addresses offline policy evaluation when data are collected adaptively, introducing Adaptive Offline Policy Evaluation (AOPE) as a generalization of traditional OPE for tabular MDPs. It analyzes the TMIS estimator under adaptive data, develops a tape-based model to handle dependencies, and derives high-probability instance-dependent bounds as well as a lower bound under adaptive exploration. The results demonstrate that while adaptive data can degrade performance in some regimes, minimax-like rates from the non-adaptive setting can be ported under reasonable exploration assumptions, and the bounds offer fine-grained insights into which regions of the MDP drive estimation error. Empirical simulations with adaptive logging reveal potential biases and context-dependent benefits, supporting the practical relevance of AOPE and guiding data-collection design for offline RL.

Abstract

Developing theoretical guarantees on the sample complexity of offline RL methods is an important step towards making data-hungry RL algorithms practically viable. Currently, most results hinge on unrealistic assumptions about the data distribution -- namely that it comprises a set of i.i.d. trajectories collected by a single logging policy. We consider a more general setting where the dataset may have been gathered adaptively. We develop theory for the TMIS Offline Policy Evaluation (OPE) estimator in this generalized setting for tabular MDPs, deriving high-probability, instance-dependent bounds on its estimation error. We also recover minimax-optimal offline learning in the adaptive setting. Finally, we conduct simulations to empirically analyze the behavior of these estimators under adaptive and non-adaptive regimes.

Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data

TL;DR

Abstract

Paper Structure (27 sections, 10 theorems, 47 equations, 7 figures)

This paper contains 27 sections, 10 theorems, 47 equations, 7 figures.

Introduction
Related Work
Novel Contributions
Preliminaries
Symbols, notation, and MDP basics.
Motivation and Problem Setup
Our estimator
Theoretical Results
Proof sketches
Lower Bound: Theorem \ref{['lower_bound']}
Empirical Results
Experimental Motivation and Design
Results
Conclusion and Future Work
Concentration inequalities
...and 12 more sections

Key Result

Theorem 3.1

Suppose $\mathcal{D}$ is a dataset conforming to AOPE, and $\hat{v}^\pi$ is formed using this dataset. Then, with probability at least $1 - \delta$, the following holds for all deterministic policies $\pi$: where $n_{h, s, a}$ is the number of occurrences of $(s_h, a_h)$ in $\mathcal{D}$ and with the convention that $\frac{0}{0} = 0$.

Figures (7)

Figure 1: An illustration of the tape view of adaptive data collection. Each row $(h, s, a)$ should be thought to contain $n$ i.i.d. samples from $P_{h + 1}(\cdot | s, a)$. The red "frontier" tracks how many samples have, in fact, been used by the logger (this quantity is always bounded by $n$).
Figure 2: Non-adaptive regime (left) versus adaptive regime (right), depicted as a graphical model. We see that, in the adaptive regime, each policy depends on all previous trajectories. This induces dependence between the trajectories.
Figure 3: For different $\pi$, the blue curves show the average value of $\sqrt{n}\times(\hat{v}^\pi - v^\pi)$, where $\hat{v}^\pi$ is computed on the first $n \leq N$ trajectories of each of the $10,000$ adaptive datasets. The orange curves are computed in the same way, except using the $10,000$ shadow datasets. On the lefthand side, $\pi$ is very suboptimal. On the righthand side, $\pi$ is optimal. Confidence intervals are 95% Gaussian.
Figure 4: A reproduction of Figure \ref{['tape pic']}. Each row $(h, s, a)$ should be thought to contain $n$ i.i.d. samples from $P_{h + 1}(\cdot | s, a)$. The red "frontier" tracks how many samples have, in fact, been used by the logger (this quantity is always bounded by $n$).
Figure 5: Branching MDP construction used in the lower bound
...and 2 more figures

Theorems & Definitions (12)

Definition 2.1: Adaptive Offline Policy Evaluation (AOPE)
Theorem 3.1: High-probability uniform bound on estimation error in AOPE
Corollary 3.2: High-probability uniform bound on estimation error in AOPE
Theorem 3.3: Instance-dependent pointwise bound on estimation error in AOPE
Corollary 3.4: Worst-case pointwise bound on estimation error in AOPE
Theorem 3.5
Lemma 6.1: Hoeffding's Inequality
Lemma 6.2: Bernstein's Inequality
Lemma 6.3: d-dimensional Concentration
Remark 11.1
...and 2 more

Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data

TL;DR

Abstract

Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (12)