Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning
Yang Xu, Swetha Ganesh, Vaneet Aggarwal
TL;DR
This work develops non-asymptotic, model-free methods for distributionally robust average-reward RL under contamination, TV, and Wasserstein uncertainty sets. It introduces a uniform one-step contraction of the robust Bellman operator using a semi-norm, enabling $\tilde{O}(ε^{-2})$ sample complexity for robust Q-learning and a robust actor-critic with policy improvements supported by uniform critic bounds. The methods rely on simulation-based estimators for the robust Bellman updates and Fréchet subgradients for policy optimization, achieving end-to-end finite-sample guarantees across all considered uncertainty sets. Numerical experiments on ride-hailing and a three-state loop illustrate the practical robustness and convergence properties of the proposed algorithms. Together, the results establish model-free, robust planning for long-horizon decision problems with provable sample-efficiency guarantees under transition uncertainty.
Abstract
We present a non-asymptotic convergence analysis of $Q$-learning and actor-critic algorithms for robust average-reward Markov Decision Processes (MDPs) under contamination, total-variation (TV) distance, and Wasserstein uncertainty sets. A key ingredient of our analysis is showing that the optimal robust $Q$ operator is a strict contraction with respect to a carefully designed semi-norm (with constant functions quotiented out). This property enables a stochastic approximation update that learns the optimal robust $Q$-function using $\tilde{\mathcal{O}}(ε^{-2})$ samples. We also provide an efficient routine for robust $Q$-function estimation, which in turn facilitates robust critic estimation. Building on this, we introduce an actor-critic algorithm that learns an $ε$-optimal robust policy within $\tilde{\mathcal{O}}(ε^{-2})$ samples. We provide numerical simulations to evaluate the performance of our algorithms.
