On The Statistical Complexity of Offline Decision-Making
Thanh Nguyen-Tang, Raman Arora
TL;DR
This work develops a unified minimax theory for offline decision-making with function approximation, identifying the pseudo-dimension of the value-function class and a new policy transfer coefficient framework as the core drivers of learnability. By introducing policy transfer coefficients that subsume prior data-coverage notions, the authors derive near-optimal lower and upper bounds for offline contextual bandits and MDPs, and extend the analysis to a hybrid offline-online setting with adaptive, Hedge-based procedures. The results show when and how offline data can accelerate online decision-making, and provide stable, distribution-shift-aware algorithms such as OfDM-Hedge and OfDM-Hedge-MDP that adapt to unknown transfer regimes. The findings illuminate the fundamental role of data quality and function-class complexity in offline RL with function approximation, while also outlining key gaps and directions for future work in nonparametric settings and fully adaptive hybrids.
Abstract
We study the statistical complexity of offline decision-making with function approximation, establishing (near) minimax-optimal rates for stochastic contextual bandits and Markov decision processes. The performance limits are captured by the pseudo-dimension of the (value) function class and a new characterization of the behavior policy that \emph{strictly} subsumes all the previous notions of data coverage in the offline decision-making literature. In addition, we seek to understand the benefits of using offline data in online decision-making and show nearly minimax-optimal rates in a wide range of regimes.
