Table of Contents
Fetching ...

ORVIT: Near-Optimal Online Distributionally Robust Reinforcement Learning

Debamita Ghosh, George K. Atia, Yue Wang

TL;DR

This paper addresses the challenge of online distributionally robust reinforcement learning when deployment may differ from training due to environment shifts. It introduces RVI-$f$, a model-based, optimistic robust planning framework that operates without generative models or offline data and supports general $f$-divergence ambiguity sets, including $\chi^2$ and KL. The authors establish sublinear regret and near-optimal sample complexity for the online robust objective, and provide minimax lower bounds to demonstrate near-optimality under these uncertainty sets. They also validate the approach experimentally on Gambler’s and Frozen Lake tasks under distributional shifts, showing improved worst-case performance and practical robustness. Overall, the work advances online DRRL by removing strong coverage assumptions and delivering data-efficient, theoretically-grounded algorithms with solid empirical evidence for real-world robustness.

Abstract

We investigate reinforcement learning (RL) in the presence of distributional mismatch between training and deployment, where policies trained in simulators often underperform in practice due to mismatches between training and deployment conditions, and thereby reliable guarantees on real-world performance are essential. Distributionally robust RL addresses this issue by optimizing worst-case performance over an uncertainty set of environments and providing an optimized lower bound on deployment performance. However, existing studies typically assume access to either a generative model or offline datasets with broad coverage of the deployment environment-assumptions that limit their practicality in unknown environments without prior knowledge. In this work, we study a more practical and challenging setting: online distributionally robust RL, where the agent interacts only with a single unknown training environment while seeking policies that are robust with respect to an uncertainty set around this nominal model. We consider general $f$-divergence-based ambiguity sets, including $χ^2$ and KL divergence balls, and design a computationally efficient algorithm that achieves sublinear regret for the robust control objective under minimal assumptions, without requiring generative or offline data access. Moreover, we establish a corresponding minimax lower bound on the regret of any online algorithm, demonstrating the near-optimality of our method. Experiments across diverse environments with model misspecification show that our approach consistently improves worst-case performance and aligns with the theoretical guarantees.

ORVIT: Near-Optimal Online Distributionally Robust Reinforcement Learning

TL;DR

This paper addresses the challenge of online distributionally robust reinforcement learning when deployment may differ from training due to environment shifts. It introduces RVI-, a model-based, optimistic robust planning framework that operates without generative models or offline data and supports general -divergence ambiguity sets, including and KL. The authors establish sublinear regret and near-optimal sample complexity for the online robust objective, and provide minimax lower bounds to demonstrate near-optimality under these uncertainty sets. They also validate the approach experimentally on Gambler’s and Frozen Lake tasks under distributional shifts, showing improved worst-case performance and practical robustness. Overall, the work advances online DRRL by removing strong coverage assumptions and delivering data-efficient, theoretically-grounded algorithms with solid empirical evidence for real-world robustness.

Abstract

We investigate reinforcement learning (RL) in the presence of distributional mismatch between training and deployment, where policies trained in simulators often underperform in practice due to mismatches between training and deployment conditions, and thereby reliable guarantees on real-world performance are essential. Distributionally robust RL addresses this issue by optimizing worst-case performance over an uncertainty set of environments and providing an optimized lower bound on deployment performance. However, existing studies typically assume access to either a generative model or offline datasets with broad coverage of the deployment environment-assumptions that limit their practicality in unknown environments without prior knowledge. In this work, we study a more practical and challenging setting: online distributionally robust RL, where the agent interacts only with a single unknown training environment while seeking policies that are robust with respect to an uncertainty set around this nominal model. We consider general -divergence-based ambiguity sets, including and KL divergence balls, and design a computationally efficient algorithm that achieves sublinear regret for the robust control objective under minimal assumptions, without requiring generative or offline data access. Moreover, we establish a corresponding minimax lower bound on the regret of any online algorithm, demonstrating the near-optimality of our method. Experiments across diverse environments with model misspecification show that our approach consistently improves worst-case performance and aligns with the theoretical guarantees.

Paper Structure

This paper contains 49 sections, 31 theorems, 132 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Consider the $\chi^2$ and KL divergence uncertainty sets. For any $\delta\in (0,1)$ and uncertainty radius $\sigma>0$, with probability at least $1-\delta$, the regret of our RVI-$f$ algorithm with corresponding bonus term as eq:Bonus_term_chi and eq:Bonus_term_KL can be bounded as: where $f(K)=\tilde{\mathcal{O}}(g(K))$ means $f(K)\leq c\cdot g(K)\cdot\textbf{Poly}(\log(K))$ for some constant $c

Figures (3)

  • Figure 1: Performance comparisons for the Gambler’s problem under RMDP-$\chi^2$ ($\sigma=0.05)$ and RMDP-KL ($\sigma=0.1)$.
  • Figure 2: Performance comparisons for the Frozen Lake under RMDP-$\chi^2$ ($\sigma=0.05)$ and RMDP-KL ($\sigma=0.1$).
  • Figure :

Theorems & Definitions (60)

  • Remark 1: Comparison between TV, $\chi^2$, and KL sets
  • Theorem 1: Regret Bound of RVI-$f$
  • Corollary 1: Sample Complexity of RVI-$f$
  • Theorem 2: Minimax Lower Bound of Online DRRL
  • Lemma 1: Bound of event $\mathcal{E}_{\chi^2}$
  • proof
  • proof
  • Lemma 1: Optimistic and pessimistic estimation of the robust values for RMDP-$\chi^2$
  • proof
  • Lemma 2: Proper bonus for RMDP-$\chi^2$ and optimistic and pessimistic value estimators
  • ...and 50 more