Foundations of Reinforcement Learning and Interactive Decision Making
Dylan J. Foster, Alexander Rakhlin
TL;DR
The notes present a unifying statistical framework for interactive decision making, spanning MAB, contextual and structured bandits, and reinforcement learning with function approximation. They develop minimax and online-to-batch perspectives, introduce online learning algorithms (EW, UCB, Exp3, Posterior Sampling) and advances such as Inverse Gap Weighting and the DEC/DEC-based E2D framework to quantify and guide exploration. A central contribution is the Decision-Estimation Coefficient, which links regret to information gain and generalizes across problem classes, enabling instance-dependent guarantees via SquareCB, IGW, and eluder-dimension concepts. The material emphasizes sample efficiency and the role of structure (linear models, Lipschitz spaces, GLMs) in enabling scalable generalization across contexts and decisions, including extensions to offline and misspecified settings.
Abstract
These lecture notes give a statistical perspective on the foundations of reinforcement learning and interactive decision making. We present a unifying framework for addressing the exploration-exploitation dilemma using frequentist and Bayesian approaches, with connections and parallels between supervised learning/estimation and decision making as an overarching theme. Special attention is paid to function approximation and flexible model classes such as neural networks. Topics covered include multi-armed and contextual bandits, structured bandits, and reinforcement learning with high-dimensional feedback.
