Central Limit Theorems for Transition Probabilities of Controlled Markov Chains
Ziwei Su, Imon Banerjee, Diego Klabjan
TL;DR
This work develops the first asymptotic normality theory for non-parametric transition-matrix estimation in finite controlled Markov chains under history-dependent logging, enabling principled statistical inference for offline RL. The authors prove a properly scaled central limit theorem for the count-based estimator of transition probabilities, under mild return-time growth and mixing assumptions, and provide a coupling-based proof technique that handles non-stationarity and non-Markovian controls. They further derive goodness-of-fit tests for transition kernels and extend the CLTs to downstream quantities in reinforcement learning, including the value, Q-, and advantage functions, as well as the value of the estimated optimal policy, thereby enabling confidence intervals and hypothesis testing in offline policy evaluation and recovery. The results illuminate when a CLT is possible or impossible, and they offer a unified framework for statistical inference in model-based offline RL across broad logging policies. Together, these contributions deliver a rigorous large-sample toolkit for evaluating and recovering policies from logged data in finite CMCs.
Abstract
We develop a central limit theorem (CLT) for the non-parametric estimator of the transition matrices in controlled Markov chains (CMCs) with finite state-action spaces. Our results establish precise conditions on the logging policy under which the estimator is asymptotically normal, and reveal settings in which no CLT can exist. We then build upon it to derive CLTs for the value, Q-, and advantage functions of any stationary stochastic policy, including the optimal policy recovered from the estimated model. Goodness-of-fit tests are derived as a corollary, which enable us to test whether the logged data is stochastic. These results provide new statistical tools for offline policy evaluation and optimal policy recovery, and enable hypothesis tests for transition probabilities.
