Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data
Sunil Madhow, Dan Qiao, Ming Yin, Yu-Xiang Wang
TL;DR
This paper addresses offline policy evaluation when data are collected adaptively, introducing Adaptive Offline Policy Evaluation (AOPE) as a generalization of traditional OPE for tabular MDPs. It analyzes the TMIS estimator under adaptive data, develops a tape-based model to handle dependencies, and derives high-probability instance-dependent bounds as well as a lower bound under adaptive exploration. The results demonstrate that while adaptive data can degrade performance in some regimes, minimax-like rates from the non-adaptive setting can be ported under reasonable exploration assumptions, and the bounds offer fine-grained insights into which regions of the MDP drive estimation error. Empirical simulations with adaptive logging reveal potential biases and context-dependent benefits, supporting the practical relevance of AOPE and guiding data-collection design for offline RL.
Abstract
Developing theoretical guarantees on the sample complexity of offline RL methods is an important step towards making data-hungry RL algorithms practically viable. Currently, most results hinge on unrealistic assumptions about the data distribution -- namely that it comprises a set of i.i.d. trajectories collected by a single logging policy. We consider a more general setting where the dataset may have been gathered adaptively. We develop theory for the TMIS Offline Policy Evaluation (OPE) estimator in this generalized setting for tabular MDPs, deriving high-probability, instance-dependent bounds on its estimation error. We also recover minimax-optimal offline learning in the adaptive setting. Finally, we conduct simulations to empirically analyze the behavior of these estimators under adaptive and non-adaptive regimes.
