Scaling Online Distributionally Robust Reinforcement Learning: Sample-Efficient Guarantees with General Function Approximation
Debamita Ghosh, George K. Atia, Yue Wang
TL;DR
The paper tackles the challenge of deploying RL under environmental shifts by formulating online distributionally robust RL with TV-divergence uncertainty and general function approximation. It introduces Robust Fitted Learning with TV-Divergence Uncertainty Set (RFL-TV), a dual-robust operator-based method that uses global confidence sets and a dual network to drive exploration, achieving a near-optimal sublinear regret bound that scales to large state-action spaces via a robust coverability metric. Theoretical contributions include a regret bound and sample complexity that depend on the robust coverability constant and are near-optimal in the linear TV-RMDP setting, along with a novel dual-optimization framework for robust Bellman equations. Empirical results on CartPole demonstrate strong robustness to action and dynamics perturbations, validate the computational viability of online DR-RL with function approximation, and show favorable comparisons to tabular and non-robust baselines across a spectrum of perturbations.
Abstract
The deployment of reinforcement learning (RL) agents in real-world applications is often hindered by performance degradation caused by mismatches between training and deployment environments. Distributionally robust RL (DR-RL) addresses this issue by optimizing worst-case performance over an uncertainty set of transition dynamics. However, existing work typically relies on substantial prior knowledge-such as access to a generative model or a large offline dataset-and largely focuses on tabular methods that do not scale to complex domains. We overcome these limitations by proposing an online DR-RL algorithm with general function approximation that learns an optimal robust policy purely through interaction with the environment, without requiring prior models or offline data, enabling deployment in high-dimensional tasks. We further provide a theoretical analysis establishing a near-optimal sublinear regret bound under a total variation uncertainty set, demonstrating the sample efficiency and effectiveness of our method.
