Table of Contents
Fetching ...

Statistical Inference in Reinforcement Learning: A Selective Survey

Chengchun Shi

TL;DR

This work surveys the role of statistical inference in reinforcement learning, focusing on hypothesis testing for the Markov property and off-policy confidence interval estimation. It introduces forward-backward generative learning to construct doubly robust tests for conditional independence and Markov assumptions, enabling robust model selection (MDP vs higher-order or POMDP) in offline RL. Across diabetes and Tiger problem case studies, the framework demonstrates practical identification of MDP order and improved policy evaluation under correct model assumptions. The contributions bridge classical statistical tools with RL practice, offering scalable methods for uncertainty quantification and model validation in sequential decision problems.

Abstract

Reinforcement learning (RL) is concerned with how intelligence agents take actions in a given environment to maximize the cumulative reward they receive. In healthcare, applying RL algorithms could assist patients in improving their health status. In ride-sharing platforms, applying RL algorithms could increase drivers' income and customer satisfaction. For large language models, applying RL algorithms could align their outputs with human preferences. Over the past decade, RL has been arguably one of the most vibrant research frontiers in machine learning. Nevertheless, statistics as a field, as opposed to computer science, has only recently begun to engage with RL both in depth and in breadth. This chapter presents a selective review of statistical inferential tools for RL, covering both hypothesis testing and confidence interval construction. Our goal is to highlight the value of statistical inference in RL for both the statistics and machine learning communities, and to promote the broader application of classical statistical inference tools in this vibrant area of research.

Statistical Inference in Reinforcement Learning: A Selective Survey

TL;DR

This work surveys the role of statistical inference in reinforcement learning, focusing on hypothesis testing for the Markov property and off-policy confidence interval estimation. It introduces forward-backward generative learning to construct doubly robust tests for conditional independence and Markov assumptions, enabling robust model selection (MDP vs higher-order or POMDP) in offline RL. Across diabetes and Tiger problem case studies, the framework demonstrates practical identification of MDP order and improved policy evaluation under correct model assumptions. The contributions bridge classical statistical tools with RL practice, offering scalable methods for uncertainty quantification and model validation in sequential decision problems.

Abstract

Reinforcement learning (RL) is concerned with how intelligence agents take actions in a given environment to maximize the cumulative reward they receive. In healthcare, applying RL algorithms could assist patients in improving their health status. In ride-sharing platforms, applying RL algorithms could increase drivers' income and customer satisfaction. For large language models, applying RL algorithms could align their outputs with human preferences. Over the past decade, RL has been arguably one of the most vibrant research frontiers in machine learning. Nevertheless, statistics as a field, as opposed to computer science, has only recently begun to engage with RL both in depth and in breadth. This chapter presents a selective review of statistical inferential tools for RL, covering both hypothesis testing and confidence interval construction. Our goal is to highlight the value of statistical inference in RL for both the statistics and machine learning communities, and to promote the broader application of classical statistical inference tools in this vibrant area of research.

Paper Structure

This paper contains 8 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Visualizations of (a) sequential decision making; (b) three types of policies.
  • Figure 2: Three RL models. Solid black lines represent causal relationships between variables, while dashed orange lines indicate the historical variables on which the optimal policy $\pi^*$ depends.
  • Figure 3: Left: The tiger problem. Middle: Percentage of rejections of the null hypothesis that the data from the tiger problem satisfies a $k$-th order Markov assumption using shi2020does's test, for $k=1,2,\cdots,10$. $n$ denotes the number of trajectories. Right: Same as Middle, but with the true location of the tiger known and included in the observations.