Table of Contents
Fetching ...

Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning

Yang Xu, Swetha Ganesh, Vaneet Aggarwal

TL;DR

This work develops non-asymptotic, model-free methods for distributionally robust average-reward RL under contamination, TV, and Wasserstein uncertainty sets. It introduces a uniform one-step contraction of the robust Bellman operator using a semi-norm, enabling $\tilde{O}(ε^{-2})$ sample complexity for robust Q-learning and a robust actor-critic with policy improvements supported by uniform critic bounds. The methods rely on simulation-based estimators for the robust Bellman updates and Fréchet subgradients for policy optimization, achieving end-to-end finite-sample guarantees across all considered uncertainty sets. Numerical experiments on ride-hailing and a three-state loop illustrate the practical robustness and convergence properties of the proposed algorithms. Together, the results establish model-free, robust planning for long-horizon decision problems with provable sample-efficiency guarantees under transition uncertainty.

Abstract

We present a non-asymptotic convergence analysis of $Q$-learning and actor-critic algorithms for robust average-reward Markov Decision Processes (MDPs) under contamination, total-variation (TV) distance, and Wasserstein uncertainty sets. A key ingredient of our analysis is showing that the optimal robust $Q$ operator is a strict contraction with respect to a carefully designed semi-norm (with constant functions quotiented out). This property enables a stochastic approximation update that learns the optimal robust $Q$-function using $\tilde{\mathcal{O}}(ε^{-2})$ samples. We also provide an efficient routine for robust $Q$-function estimation, which in turn facilitates robust critic estimation. Building on this, we introduce an actor-critic algorithm that learns an $ε$-optimal robust policy within $\tilde{\mathcal{O}}(ε^{-2})$ samples. We provide numerical simulations to evaluate the performance of our algorithms.

Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning

TL;DR

This work develops non-asymptotic, model-free methods for distributionally robust average-reward RL under contamination, TV, and Wasserstein uncertainty sets. It introduces a uniform one-step contraction of the robust Bellman operator using a semi-norm, enabling sample complexity for robust Q-learning and a robust actor-critic with policy improvements supported by uniform critic bounds. The methods rely on simulation-based estimators for the robust Bellman updates and Fréchet subgradients for policy optimization, achieving end-to-end finite-sample guarantees across all considered uncertainty sets. Numerical experiments on ride-hailing and a three-state loop illustrate the practical robustness and convergence properties of the proposed algorithms. Together, the results establish model-free, robust planning for long-horizon decision problems with provable sample-efficiency guarantees under transition uncertainty.

Abstract

We present a non-asymptotic convergence analysis of -learning and actor-critic algorithms for robust average-reward Markov Decision Processes (MDPs) under contamination, total-variation (TV) distance, and Wasserstein uncertainty sets. A key ingredient of our analysis is showing that the optimal robust operator is a strict contraction with respect to a carefully designed semi-norm (with constant functions quotiented out). This property enables a stochastic approximation update that learns the optimal robust -function using samples. We also provide an efficient routine for robust -function estimation, which in turn facilitates robust critic estimation. Building on this, we introduce an actor-critic algorithm that learns an -optimal robust policy within samples. We provide numerical simulations to evaluate the performance of our algorithms.

Paper Structure

This paper contains 34 sections, 26 theorems, 207 equations, 2 figures, 5 algorithms.

Key Result

Theorem 3.1

For a fixed policy $\pi$ and for each $s \in \mathcal{S}$, define the Robust Bellman operator with scalar $g$ as follows where $\sigma_{\mathcal{P}^a_s}(V) \coloneqq \min_{p\in\mathcal{P}^a_s} p^\top V$. If $(g,V)$ is a solution to the robust Bellman equation: then the scalar $g$ corresponds to the robust average reward, i.e., $g = g^\pi_\mathcal{P}$, and the worst-case transition kernel $\maths

Figures (2)

  • Figure 1: Ride-hailing via robust $Q$-learning
  • Figure 2: Control loop via robust actor-critic

Theorems & Definitions (47)

  • Theorem 3.1: Robust Bellman Equation, Theorem 3.1 in wang2023model
  • Lemma 4.1: Lemma 4.1 in wang2023model
  • Theorem 4.3
  • proof : Proof sketch
  • Theorem 4.4
  • proof : Proof sketch
  • Proposition 5.1
  • Theorem 5.2
  • Remark 5.3
  • Definition 5.5
  • ...and 37 more