Table of Contents
Fetching ...

Concentration of Cumulative Reward in Markov Decision Processes

Borna Sayedana, Peter E. Caines, Aditya Mahajan

TL;DR

The paper develops a unified martingale-based framework to characterize both asymptotic and non-asymptotic concentration of cumulative rewards in MDPs across average, discounted, and finite-horizon settings. It establishes LLN, CLT, and LIL for average-reward policies, alongside Azuma-type and non-asymptotic LIL bounds, and extends these results to discounted and finite-horizon frameworks with corresponding non-asymptotic guarantees. A key insight is that the sample-path difference between policies can be tightly controlled, enabling rate-equivalence results between two common notions of regret in RL. The approach relies on a martingale decomposition tied to the policy evaluation equation, with extensions to stochastic rewards, and vanishing-discount analysis linking discounted and average-reward bounds. These results provide risk-aware performance guarantees for policy evaluation and learning in high-stakes domains.

Abstract

In this paper, we investigate the concentration properties of cumulative reward in Markov Decision Processes (MDPs), focusing on both asymptotic and non-asymptotic settings. We introduce a unified approach to characterize reward concentration in MDPs, covering both infinite-horizon settings (i.e., average and discounted reward frameworks) and finite-horizon setting. Our asymptotic results include the law of large numbers, the central limit theorem, and the law of iterated logarithms, while our non-asymptotic bounds include Azuma-Hoeffding-type inequalities and a non-asymptotic version of the law of iterated logarithms. Additionally, we explore two key implications of our results. First, we analyze the sample path behavior of the difference in rewards between any two stationary policies. Second, we show that two alternative definitions of regret for learning policies proposed in the literature are rate-equivalent. Our proof techniques rely on a martingale decomposition of cumulative reward, properties of the solution to the policy evaluation fixed-point equation, and both asymptotic and non-asymptotic concentration results for martingale difference sequences.

Concentration of Cumulative Reward in Markov Decision Processes

TL;DR

The paper develops a unified martingale-based framework to characterize both asymptotic and non-asymptotic concentration of cumulative rewards in MDPs across average, discounted, and finite-horizon settings. It establishes LLN, CLT, and LIL for average-reward policies, alongside Azuma-type and non-asymptotic LIL bounds, and extends these results to discounted and finite-horizon frameworks with corresponding non-asymptotic guarantees. A key insight is that the sample-path difference between policies can be tightly controlled, enabling rate-equivalence results between two common notions of regret in RL. The approach relies on a martingale decomposition tied to the policy evaluation equation, with extensions to stochastic rewards, and vanishing-discount analysis linking discounted and average-reward bounds. These results provide risk-aware performance guarantees for policy evaluation and learning in high-stakes domains.

Abstract

In this paper, we investigate the concentration properties of cumulative reward in Markov Decision Processes (MDPs), focusing on both asymptotic and non-asymptotic settings. We introduce a unified approach to characterize reward concentration in MDPs, covering both infinite-horizon settings (i.e., average and discounted reward frameworks) and finite-horizon setting. Our asymptotic results include the law of large numbers, the central limit theorem, and the law of iterated logarithms, while our non-asymptotic bounds include Azuma-Hoeffding-type inequalities and a non-asymptotic version of the law of iterated logarithms. Additionally, we explore two key implications of our results. First, we analyze the sample path behavior of the difference in rewards between any two stationary policies. Second, we show that two alternative definitions of regret for learning policies proposed in the literature are rate-equivalent. Our proof techniques rely on a martingale decomposition of cumulative reward, properties of the solution to the policy evaluation fixed-point equation, and both asymptotic and non-asymptotic concentration results for martingale difference sequences.

Paper Structure

This paper contains 94 sections, 50 theorems, 239 equations.

Key Result

Proposition 5

Suppose model $\mathcal{M} = (P,r)$ is AROE solvable with a solution $(\lambda^{*},V^{*})$. Then:

Theorems & Definitions (79)

  • Definition 1
  • Definition 2: AROE Solvability
  • Definition 3
  • Definition 4
  • Proposition 5: bertsekas2012dynamic
  • Proposition 6: bertsekas2012dynamic
  • Definition 7: kallenberg2002classification
  • Proposition 8: puterman2014markov
  • Proposition 9: puterman2014markov
  • Remark 10
  • ...and 69 more