Table of Contents
Fetching ...

Finite-Sample Analysis of Policy Evaluation for Robust Average Reward Reinforcement Learning

Yang Xu, Washim Uddin Mondal, Vaneet Aggarwal

TL;DR

The paper provides the first finite-sample guarantees for policy evaluation in robust average-reward MDPs by proving a contraction of the robust Bellman operator under a specially crafted semi-norm and coupling this with a biased stochastic approximation framework. It introduces a truncated MLMC estimator to compute worst-case effects under TV and Wasserstein uncertainty sets with finite expected samples, achieving tilde O(ε^{-2}) complexity for both value and average-reward estimation. A robust TD learning algorithm is developed to iteratively update the robust value function and robust average reward, with explicit bias-aware analysis ensuring finite-time convergence. The work emphasizes ergodicity of the nominal model and provides a foundation for robust, sample-efficient long-horizon RL, with clear extensions to more general uncertainty sets and function-approximation contexts.

Abstract

We present the first finite-sample analysis of policy evaluation in robust average-reward Markov Decision Processes (MDPs). Prior work in this setting have established only asymptotic convergence guarantees, leaving open the question of sample complexity. In this work, we address this gap by showing that the robust Bellman operator is a contraction under a carefully constructed semi-norm, and developing a stochastic approximation framework with controlled bias. Our approach builds upon Multi-Level Monte Carlo (MLMC) techniques to estimate the robust Bellman operator efficiently. To overcome the infinite expected sample complexity inherent in standard MLMC, we introduce a truncation mechanism based on a geometric distribution, ensuring a finite expected sample complexity while maintaining a small bias that decays exponentially with the truncation level. Our method achieves the order-optimal sample complexity of $\tilde{\mathcal{O}}(ε^{-2})$ for robust policy evaluation and robust average reward estimation, marking a significant advancement in robust reinforcement learning theory.

Finite-Sample Analysis of Policy Evaluation for Robust Average Reward Reinforcement Learning

TL;DR

The paper provides the first finite-sample guarantees for policy evaluation in robust average-reward MDPs by proving a contraction of the robust Bellman operator under a specially crafted semi-norm and coupling this with a biased stochastic approximation framework. It introduces a truncated MLMC estimator to compute worst-case effects under TV and Wasserstein uncertainty sets with finite expected samples, achieving tilde O(ε^{-2}) complexity for both value and average-reward estimation. A robust TD learning algorithm is developed to iteratively update the robust value function and robust average reward, with explicit bias-aware analysis ensuring finite-time convergence. The work emphasizes ergodicity of the nominal model and provides a foundation for robust, sample-efficient long-horizon RL, with clear extensions to more general uncertainty sets and function-approximation contexts.

Abstract

We present the first finite-sample analysis of policy evaluation in robust average-reward Markov Decision Processes (MDPs). Prior work in this setting have established only asymptotic convergence guarantees, leaving open the question of sample complexity. In this work, we address this gap by showing that the robust Bellman operator is a contraction under a carefully constructed semi-norm, and developing a stochastic approximation framework with controlled bias. Our approach builds upon Multi-Level Monte Carlo (MLMC) techniques to estimate the robust Bellman operator efficiently. To overcome the infinite expected sample complexity inherent in standard MLMC, we introduce a truncation mechanism based on a geometric distribution, ensuring a finite expected sample complexity while maintaining a small bias that decays exponentially with the truncation level. Our method achieves the order-optimal sample complexity of for robust policy evaluation and robust average reward estimation, marking a significant advancement in robust reinforcement learning theory.

Paper Structure

This paper contains 44 sections, 27 theorems, 223 equations, 4 tables, 1 algorithm.

Key Result

Theorem 3.2

If $(g,V)$ is a solution to the robust Bellman equation where $\sigma_{\mathcal{P}^a_s}(V) = \min_{p\in\mathcal{P}^a_s} p^\top V$ is denoted as the support function, then the scalar $g$ corresponds to the robust average reward, i.e., $g = g^\pi_\mathcal{P}$, and the worst-case transition kernel $\mathsf P_V$ belongs to the set of minimizing transition ke

Theorems & Definitions (41)

  • Theorem 3.2: Robust Bellman Equation, Theorem 3.1 in wang2023model
  • Definition 3.3: Robust Bellman Operator, wang2023model
  • Lemma 4.1
  • Theorem 4.2
  • Theorem 5.1: Finite Sample Complexity
  • Theorem 5.2: Exponentially Decaying Bias
  • Lemma 5.3
  • Theorem 5.4: Linear Variance
  • Theorem 6.1
  • Theorem 6.2
  • ...and 31 more