Table of Contents
Fetching ...

Scaling Online Distributionally Robust Reinforcement Learning: Sample-Efficient Guarantees with General Function Approximation

Debamita Ghosh, George K. Atia, Yue Wang

TL;DR

The paper tackles the challenge of deploying RL under environmental shifts by formulating online distributionally robust RL with TV-divergence uncertainty and general function approximation. It introduces Robust Fitted Learning with TV-Divergence Uncertainty Set (RFL-TV), a dual-robust operator-based method that uses global confidence sets and a dual network to drive exploration, achieving a near-optimal sublinear regret bound that scales to large state-action spaces via a robust coverability metric. Theoretical contributions include a regret bound and sample complexity that depend on the robust coverability constant and are near-optimal in the linear TV-RMDP setting, along with a novel dual-optimization framework for robust Bellman equations. Empirical results on CartPole demonstrate strong robustness to action and dynamics perturbations, validate the computational viability of online DR-RL with function approximation, and show favorable comparisons to tabular and non-robust baselines across a spectrum of perturbations.

Abstract

The deployment of reinforcement learning (RL) agents in real-world applications is often hindered by performance degradation caused by mismatches between training and deployment environments. Distributionally robust RL (DR-RL) addresses this issue by optimizing worst-case performance over an uncertainty set of transition dynamics. However, existing work typically relies on substantial prior knowledge-such as access to a generative model or a large offline dataset-and largely focuses on tabular methods that do not scale to complex domains. We overcome these limitations by proposing an online DR-RL algorithm with general function approximation that learns an optimal robust policy purely through interaction with the environment, without requiring prior models or offline data, enabling deployment in high-dimensional tasks. We further provide a theoretical analysis establishing a near-optimal sublinear regret bound under a total variation uncertainty set, demonstrating the sample efficiency and effectiveness of our method.

Scaling Online Distributionally Robust Reinforcement Learning: Sample-Efficient Guarantees with General Function Approximation

TL;DR

The paper tackles the challenge of deploying RL under environmental shifts by formulating online distributionally robust RL with TV-divergence uncertainty and general function approximation. It introduces Robust Fitted Learning with TV-Divergence Uncertainty Set (RFL-TV), a dual-robust operator-based method that uses global confidence sets and a dual network to drive exploration, achieving a near-optimal sublinear regret bound that scales to large state-action spaces via a robust coverability metric. Theoretical contributions include a regret bound and sample complexity that depend on the robust coverability constant and are near-optimal in the linear TV-RMDP setting, along with a novel dual-optimization framework for robust Bellman equations. Empirical results on CartPole demonstrate strong robustness to action and dynamics perturbations, validate the computational viability of online DR-RL with function approximation, and show favorable comparisons to tabular and non-robust baselines across a spectrum of perturbations.

Abstract

The deployment of reinforcement learning (RL) agents in real-world applications is often hindered by performance degradation caused by mismatches between training and deployment environments. Distributionally robust RL (DR-RL) addresses this issue by optimizing worst-case performance over an uncertainty set of transition dynamics. However, existing work typically relies on substantial prior knowledge-such as access to a generative model or a large offline dataset-and largely focuses on tabular methods that do not scale to complex domains. We overcome these limitations by proposing an online DR-RL algorithm with general function approximation that learns an optimal robust policy purely through interaction with the environment, without requiring prior models or offline data, enabling deployment in high-dimensional tasks. We further provide a theoretical analysis establishing a near-optimal sublinear regret bound under a total variation uncertainty set, demonstrating the sample efficiency and effectiveness of our method.

Paper Structure

This paper contains 39 sections, 17 theorems, 83 equations, 3 figures, 3 tables, 2 algorithms.

Key Result

Lemma 1

Let $\pi$ be any policy, and let $\mu_h^{\pi}$ denote the visitation measure on $\mathcal{S}\times\mathcal{A}$ at step $h$ induced by $\pi$ under $P^{\star}$. Suppose $\mathcal{D}$ is a dataset collected by running $\pi$. Then, for any $\delta \in (0,1)$, with probability at least $1-\delta$,

Figures (3)

  • Figure 1: RFL-TV vs. Functional Approximation Algorithms
  • Figure 2: RFL-TV vs. OPROVI-TV (Tabular).
  • Figure 3: RFL-TV: uncertainty level $\sigma$ vs. Uniform dual-approximation error $\xi_{\mathrm{dual}}$.

Theorems & Definitions (37)

  • Definition 1: TV-Divergence Uncertainty Set
  • Definition 2: Visitation measure ICML2025_OnlineDRMDPSampleComplexity_He
  • Definition 3: Robust Coverability
  • Lemma 1
  • Remark 1: Relation to $\varphi$-regularized RMDPs Arxiv2024_ModelFreeRobustRL_Panaganti
  • Theorem 1
  • Remark 2
  • Corollary 1: Sample Complexity
  • Remark 3
  • Remark 4: Tabular
  • ...and 27 more