Finite-Sample Analysis of Policy Evaluation for Robust Average Reward Reinforcement Learning

Yang Xu; Washim Uddin Mondal; Vaneet Aggarwal

Finite-Sample Analysis of Policy Evaluation for Robust Average Reward Reinforcement Learning

Yang Xu, Washim Uddin Mondal, Vaneet Aggarwal

TL;DR

The paper provides the first finite-sample guarantees for policy evaluation in robust average-reward MDPs by proving a contraction of the robust Bellman operator under a specially crafted semi-norm and coupling this with a biased stochastic approximation framework. It introduces a truncated MLMC estimator to compute worst-case effects under TV and Wasserstein uncertainty sets with finite expected samples, achieving tilde O(ε^{-2}) complexity for both value and average-reward estimation. A robust TD learning algorithm is developed to iteratively update the robust value function and robust average reward, with explicit bias-aware analysis ensuring finite-time convergence. The work emphasizes ergodicity of the nominal model and provides a foundation for robust, sample-efficient long-horizon RL, with clear extensions to more general uncertainty sets and function-approximation contexts.

Abstract

We present the first finite-sample analysis of policy evaluation in robust average-reward Markov Decision Processes (MDPs). Prior work in this setting have established only asymptotic convergence guarantees, leaving open the question of sample complexity. In this work, we address this gap by showing that the robust Bellman operator is a contraction under a carefully constructed semi-norm, and developing a stochastic approximation framework with controlled bias. Our approach builds upon Multi-Level Monte Carlo (MLMC) techniques to estimate the robust Bellman operator efficiently. To overcome the infinite expected sample complexity inherent in standard MLMC, we introduce a truncation mechanism based on a geometric distribution, ensuring a finite expected sample complexity while maintaining a small bias that decays exponentially with the truncation level. Our method achieves the order-optimal sample complexity of $\tilde{\mathcal{O}}(ε^{-2})$ for robust policy evaluation and robust average reward estimation, marking a significant advancement in robust reinforcement learning theory.

Finite-Sample Analysis of Policy Evaluation for Robust Average Reward Reinforcement Learning

TL;DR

Abstract

Finite-Sample Analysis of Policy Evaluation for Robust Average Reward Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (41)