Table of Contents
Fetching ...

Finite-Time Bounds for Distributionally Robust TD Learning with Linear Function Approximation

Saptarshi Mandal, Yashaswini Murthy, R. Srikant

TL;DR

This work addresses finite-time learning guarantees for distributionally robust reinforcement learning with linear function approximation, focusing on TV and Wasserstein uncertainty sets around a nominal model. It introduces a model-free, non-generative two-time-scale algorithm with an outer target-network update and inner-loop dual optimization to handle robustness, achieving an $\tilde{O}(1/\epsilon^2)$ sample complexity up to function-approximation error. The analysis handles non-contractive projected robust Bellman operators by dual-value function approximation and leverages mixing properties to bound noise, bias, and projection mismatch. The results bridge theory and practice by providing non-asymptotic guarantees for robust TD (and extendable to robust Q-learning) under broad uncertainty classes, with practical implications for robust RL in large state spaces. Extensions to additional uncertainty metrics and a discussion of computational trade-offs for Wasserstein updates highlight future directions toward scalable, model-free DRRL.

Abstract

Distributionally robust reinforcement learning (DRRL) focuses on designing policies that achieve good performance under model uncertainties. In particular, we are interested in maximizing the worst-case long-term discounted reward, where the data for RL comes from a nominal model while the deployed environment can deviate from the nominal model within a prescribed uncertainty set. Existing convergence guarantees for robust temporal-difference (TD) learning for policy evaluation are limited to tabular MDPs or are dependent on restrictive discount-factor assumptions when function approximation is used. We present the first robust TD learning with linear function approximation, where robustness is measured with respect to the total-variation distance and Wasserstein-l distance uncertainty set. Additionally, our algorithm is both model-free and does not require generative access to the MDP. Our algorithm combines a two-time-scale stochastic-approximation update with an outer-loop target-network update. We establish an $\tilde{O}(1/ε^2)$ sample complexity to obtain an $ε$-accurate value estimate. Our results close a key gap between the empirical success of robust RL algorithms and the non-asymptotic guarantees enjoyed by their non-robust counterparts. The key ideas in the paper also extend in a relatively straightforward fashion to robust Q-learning with function approximation.

Finite-Time Bounds for Distributionally Robust TD Learning with Linear Function Approximation

TL;DR

This work addresses finite-time learning guarantees for distributionally robust reinforcement learning with linear function approximation, focusing on TV and Wasserstein uncertainty sets around a nominal model. It introduces a model-free, non-generative two-time-scale algorithm with an outer target-network update and inner-loop dual optimization to handle robustness, achieving an sample complexity up to function-approximation error. The analysis handles non-contractive projected robust Bellman operators by dual-value function approximation and leverages mixing properties to bound noise, bias, and projection mismatch. The results bridge theory and practice by providing non-asymptotic guarantees for robust TD (and extendable to robust Q-learning) under broad uncertainty classes, with practical implications for robust RL in large state spaces. Extensions to additional uncertainty metrics and a discussion of computational trade-offs for Wasserstein updates highlight future directions toward scalable, model-free DRRL.

Abstract

Distributionally robust reinforcement learning (DRRL) focuses on designing policies that achieve good performance under model uncertainties. In particular, we are interested in maximizing the worst-case long-term discounted reward, where the data for RL comes from a nominal model while the deployed environment can deviate from the nominal model within a prescribed uncertainty set. Existing convergence guarantees for robust temporal-difference (TD) learning for policy evaluation are limited to tabular MDPs or are dependent on restrictive discount-factor assumptions when function approximation is used. We present the first robust TD learning with linear function approximation, where robustness is measured with respect to the total-variation distance and Wasserstein-l distance uncertainty set. Additionally, our algorithm is both model-free and does not require generative access to the MDP. Our algorithm combines a two-time-scale stochastic-approximation update with an outer-loop target-network update. We establish an sample complexity to obtain an -accurate value estimate. Our results close a key gap between the empirical success of robust RL algorithms and the non-asymptotic guarantees enjoyed by their non-robust counterparts. The key ideas in the paper also extend in a relatively straightforward fashion to robust Q-learning with function approximation.

Paper Structure

This paper contains 51 sections, 9 theorems, 209 equations, 2 algorithms.

Key Result

Theorem 1

Define Assume assumption: general uncertainty set and ass:ergodic hold, and run $K\ge k_{\mathrm{mix}}$ inner iterations per outer iteration. Then, for any $T\ge1$, we have where the residual obeys the schedule-dependent rates where, $\mathcal{O}$ hides the problem-dependent constants depending on $(\gamma,\mu,\sigma_{B_\nu},g_M,C_{\mathrm{mix}},\rho,B_\nu,\beta_0,c)$.

Theorems & Definitions (13)

  • Theorem 1: Finite-time bound: rates and dependencies (informal)
  • Corollary 1: Sample Complexity
  • Lemma 1
  • Lemma 2
  • Proof 1
  • Lemma 3
  • Lemma 4: Cross with Markov mismatch
  • Proof 2
  • Lemma 5: Bias second order at $1/k$
  • Proof 3
  • ...and 3 more