Offline Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity

Imon Banerjee; Harsha Honnappa; Vinayak Rao

Offline Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity

Imon Banerjee, Harsha Honnappa, Vinayak Rao

TL;DR

This paperAddress offline estimation of transition matrices in finite-state controlled Markov chains under a fixed logging policy, using a simple nonparametric estimator based on visitation counts. It provides PAC-minimax guarantees that tie estimation error to the logging policy’s mixing properties, and shows the estimator is minimax-optimal under geometric mixing, with weaker mixing requiring larger samples. The results extend to offline policy evaluation and policy optimization, and apply across stationary, Markov, episodic, and non-stationary controls, including inventory management examples. The work offers practical guidelines on data collection (mixing, return times) and yields a general framework for sample-efficient offline identification of controlled stochastic systems with potential extensions to continuous-state spaces and adversarial settings.

Abstract

In this work, we study a natural nonparametric estimator of the transition probability matrices of a finite controlled Markov chain. We consider an offline setting with a fixed dataset, collected using a so-called logging policy. We develop sample complexity bounds for the estimator and establish conditions for minimaxity. Our statistical bounds depend on the logging policy through its mixing properties. We show that achieving a particular statistical risk bound involves a subtle and interesting trade-off between the strength of the mixing properties and the number of samples. We demonstrate the validity of our results under various examples, such as ergodic Markov chains, weakly ergodic inhomogeneous Markov chains, and controlled Markov chains with non-stationary Markov, episodic, and greedy controls. Lastly, we use these sample complexity bounds to establish concomitant ones for offline evaluation of stationary Markov control policies.

Offline Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity

TL;DR

Abstract

Paper Structure (67 sections, 36 theorems, 332 equations, 1 table)

This paper contains 67 sections, 36 theorems, 332 equations, 1 table.

Introduction
Non-parametric estimation:
System Identification:
Model-Based Offline Reinforcement Learning:
Preliminaries
Definitions.
Mixing Coefficients
Empirical Estimation of Transition Probability Matrices
Sketch of Proof of Theorem \ref{['thm:sample-complexity']}
Minimax Sample Complexity
Sketch of Proof of Theorem \ref{['thm:minimax']}
CASE I:
CASE II:
Applications
Reduction of Assumptions
...and 52 more sections

Key Result

Lemma 2.1

The uniform and weak mixing coefficients in equations def:weak-mixing and def:uniform-mixing satisfy $\phi_{i,j}\leq \bar{\eta}_{i,j}\leq 2\phi_{i,j}.$

Theorems & Definitions (90)

Definition 2.1
Remark 2.1
Definition 2.2
Lemma 2.1
Remark 2.2
Remark 2.3
Lemma 2.2
Proposition 3.1
Remark 3.1
Lemma 3.1
...and 80 more

Offline Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity

TL;DR

Abstract

Offline Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (90)