Table of Contents
Fetching ...

Central Limit Theorems for Transition Probabilities of Controlled Markov Chains

Ziwei Su, Imon Banerjee, Diego Klabjan

TL;DR

This work develops the first asymptotic normality theory for non-parametric transition-matrix estimation in finite controlled Markov chains under history-dependent logging, enabling principled statistical inference for offline RL. The authors prove a properly scaled central limit theorem for the count-based estimator of transition probabilities, under mild return-time growth and mixing assumptions, and provide a coupling-based proof technique that handles non-stationarity and non-Markovian controls. They further derive goodness-of-fit tests for transition kernels and extend the CLTs to downstream quantities in reinforcement learning, including the value, Q-, and advantage functions, as well as the value of the estimated optimal policy, thereby enabling confidence intervals and hypothesis testing in offline policy evaluation and recovery. The results illuminate when a CLT is possible or impossible, and they offer a unified framework for statistical inference in model-based offline RL across broad logging policies. Together, these contributions deliver a rigorous large-sample toolkit for evaluating and recovering policies from logged data in finite CMCs.

Abstract

We develop a central limit theorem (CLT) for the non-parametric estimator of the transition matrices in controlled Markov chains (CMCs) with finite state-action spaces. Our results establish precise conditions on the logging policy under which the estimator is asymptotically normal, and reveal settings in which no CLT can exist. We then build upon it to derive CLTs for the value, Q-, and advantage functions of any stationary stochastic policy, including the optimal policy recovered from the estimated model. Goodness-of-fit tests are derived as a corollary, which enable us to test whether the logged data is stochastic. These results provide new statistical tools for offline policy evaluation and optimal policy recovery, and enable hypothesis tests for transition probabilities.

Central Limit Theorems for Transition Probabilities of Controlled Markov Chains

TL;DR

This work develops the first asymptotic normality theory for non-parametric transition-matrix estimation in finite controlled Markov chains under history-dependent logging, enabling principled statistical inference for offline RL. The authors prove a properly scaled central limit theorem for the count-based estimator of transition probabilities, under mild return-time growth and mixing assumptions, and provide a coupling-based proof technique that handles non-stationarity and non-Markovian controls. They further derive goodness-of-fit tests for transition kernels and extend the CLTs to downstream quantities in reinforcement learning, including the value, Q-, and advantage functions, as well as the value of the estimated optimal policy, thereby enabling confidence intervals and hypothesis testing in offline policy evaluation and recovery. The results illuminate when a CLT is possible or impossible, and they offer a unified framework for statistical inference in model-based offline RL across broad logging policies. Together, these contributions deliver a rigorous large-sample toolkit for evaluating and recovering policies from logged data in finite CMCs.

Abstract

We develop a central limit theorem (CLT) for the non-parametric estimator of the transition matrices in controlled Markov chains (CMCs) with finite state-action spaces. Our results establish precise conditions on the logging policy under which the estimator is asymptotically normal, and reveal settings in which no CLT can exist. We then build upon it to derive CLTs for the value, Q-, and advantage functions of any stationary stochastic policy, including the optimal policy recovered from the estimated model. Goodness-of-fit tests are derived as a corollary, which enable us to test whether the logged data is stochastic. These results provide new statistical tools for offline policy evaluation and optimal policy recovery, and enable hypothesis tests for transition probabilities.

Paper Structure

This paper contains 45 sections, 16 theorems, 144 equations.

Key Result

Proposition 1

For controlled Markov chain that satisfies Assumption ass:return-time-growth,

Theorems & Definitions (28)

  • Definition 1
  • Remark 1
  • Definition 2
  • Example 1: Inhomogeneous Markov chain
  • Proposition 1
  • Remark 2
  • Example 2: Non-stationary Markov controls
  • Remark 3
  • Lemma 1
  • Definition 3: Ergodic Occupation Measure
  • ...and 18 more