Finite-Time Bounds for Distributionally Robust TD Learning with Linear Function Approximation
Saptarshi Mandal, Yashaswini Murthy, R. Srikant
TL;DR
This work addresses finite-time learning guarantees for distributionally robust reinforcement learning with linear function approximation, focusing on TV and Wasserstein uncertainty sets around a nominal model. It introduces a model-free, non-generative two-time-scale algorithm with an outer target-network update and inner-loop dual optimization to handle robustness, achieving an $\tilde{O}(1/\epsilon^2)$ sample complexity up to function-approximation error. The analysis handles non-contractive projected robust Bellman operators by dual-value function approximation and leverages mixing properties to bound noise, bias, and projection mismatch. The results bridge theory and practice by providing non-asymptotic guarantees for robust TD (and extendable to robust Q-learning) under broad uncertainty classes, with practical implications for robust RL in large state spaces. Extensions to additional uncertainty metrics and a discussion of computational trade-offs for Wasserstein updates highlight future directions toward scalable, model-free DRRL.
Abstract
Distributionally robust reinforcement learning (DRRL) focuses on designing policies that achieve good performance under model uncertainties. In particular, we are interested in maximizing the worst-case long-term discounted reward, where the data for RL comes from a nominal model while the deployed environment can deviate from the nominal model within a prescribed uncertainty set. Existing convergence guarantees for robust temporal-difference (TD) learning for policy evaluation are limited to tabular MDPs or are dependent on restrictive discount-factor assumptions when function approximation is used. We present the first robust TD learning with linear function approximation, where robustness is measured with respect to the total-variation distance and Wasserstein-l distance uncertainty set. Additionally, our algorithm is both model-free and does not require generative access to the MDP. Our algorithm combines a two-time-scale stochastic-approximation update with an outer-loop target-network update. We establish an $\tilde{O}(1/ε^2)$ sample complexity to obtain an $ε$-accurate value estimate. Our results close a key gap between the empirical success of robust RL algorithms and the non-asymptotic guarantees enjoyed by their non-robust counterparts. The key ideas in the paper also extend in a relatively straightforward fashion to robust Q-learning with function approximation.
