Foundations of Reinforcement Learning and Interactive Decision Making

Dylan J. Foster; Alexander Rakhlin

Foundations of Reinforcement Learning and Interactive Decision Making

Dylan J. Foster, Alexander Rakhlin

TL;DR

The notes present a unifying statistical framework for interactive decision making, spanning MAB, contextual and structured bandits, and reinforcement learning with function approximation. They develop minimax and online-to-batch perspectives, introduce online learning algorithms (EW, UCB, Exp3, Posterior Sampling) and advances such as Inverse Gap Weighting and the DEC/DEC-based E2D framework to quantify and guide exploration. A central contribution is the Decision-Estimation Coefficient, which links regret to information gain and generalizes across problem classes, enabling instance-dependent guarantees via SquareCB, IGW, and eluder-dimension concepts. The material emphasizes sample efficiency and the role of structure (linear models, Lipschitz spaces, GLMs) in enabling scalable generalization across contexts and decisions, including extensions to offline and misspecified settings.

Abstract

These lecture notes give a statistical perspective on the foundations of reinforcement learning and interactive decision making. We present a unifying framework for addressing the exploration-exploitation dilemma using frequentist and Bayesian approaches, with connections and parallels between supervised learning/estimation and decision making as an overarching theme. Special attention is paid to function approximation and flexible model classes such as neural networks. Topics covered include multi-armed and contextual bandits, structured bandits, and reinforcement learning with high-dimensional feedback.

Foundations of Reinforcement Learning and Interactive Decision Making

TL;DR

Abstract

Paper Structure (147 sections, 1 theorem, 635 equations, 10 figures)

This paper contains 147 sections, 1 theorem, 635 equations, 10 figures.

Introduction
Decision Making
A Spectrum of Decision Making Problems
Minimax Perspective
Statistical Learning: Brief Refresher
Empirical risk minimization and excess risk
Connection to estimation
Guarantees for ERM
Refresher: Random Variables and Averages
Online Learning and Prediction
Connection to Statistical Learning
The Exponential Weights Algorithm
Exercises
Multi-Armed Bandits
The Need for Exploration
...and 132 more sections

Key Result

theorem 1

For any $\delta>0$, UCB-VI with guarantees that with probability at least $1-\delta$,

Figures (10)

Figure 1: A general decision making problem.
Figure 2: Landscape of decision making problems.
Figure 3: Conditional density estimation.
Figure 4: An illustration of the multi-armed bandit problem. A doctor (the learner) aims to select a treatment (the decision) to improve a patient's vital signs (the reward).
Figure 5: Illustration of the UCB algorithm. Selecting the action $\pi^{{t}}$ optimistically ensures that the suboptimality never greater exceeds the confidence width.
...and 5 more figures

Theorems & Definitions (87)

definition 1
proof : Proof of \ref{['lem:square_well_specified']}
definition 2
proof : Proof of \ref{['lem:erm_uniform_dev']}
proof : Proof of \ref{['prop:online_to_batch']}
proof : Proof of \ref{['prop:online_bounds']}
proof : Proof of \ref{['prop:eps_greedy']}
proof : Proof of \ref{['lem:regret_optimistic']}
proof : Proof of \ref{['prop:ucb']}
proof : Proof of \ref{['lem:confidence_width_potential']}
...and 77 more

Foundations of Reinforcement Learning and Interactive Decision Making

TL;DR

Abstract

Foundations of Reinforcement Learning and Interactive Decision Making

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (87)