Table of Contents
Fetching ...

The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model

Laixi Shi, Gen Li, Yuting Wei, Yuxin Chen, Matthieu Geist, Yuejie Chi

TL;DR

This work analyzes how distributional robustness affects data efficiency in reinforcement learning when a generative model is available. By studying RMDPs with TV distance and χ2 divergence, the authors develop a model-based distributionally robust value iteration (DRVI) algorithm and provide minimax-tight upper and lower bounds on sample complexity across uncertainty levels. A nuanced picture emerges: TV-based robustness can make learning easier or comparable to standard MDPs, while χ2-based robustness typically increases sample complexity, especially for large uncertainty. The results extend to offline RL with uniform data coverage and reveal that the uncertainty-set geometry crucially determines the statistical cost of robustness. Overall, robustness in RL is not universally harder or easier to learn; its impact depends sensitively on the chosen divergence and the shape of the uncertainty set, with concrete implications for when to privilege robustness in practice.

Abstract

This paper investigates model robustness in reinforcement learning (RL) to reduce the sim-to-real gap in practice. We adopt the framework of distributionally robust Markov decision processes (RMDPs), aimed at learning a policy that optimizes the worst-case performance when the deployed environment falls within a prescribed uncertainty set around the nominal MDP. Despite recent efforts, the sample complexity of RMDPs remained mostly unsettled regardless of the uncertainty set in use. It was unclear if distributional robustness bears any statistical consequences when benchmarked against standard RL. Assuming access to a generative model that draws samples based on the nominal MDP, we provide a near-optimal characterization of the sample complexity of RMDPs when the uncertainty set is specified via either the total variation (TV) distance or chi-squared divergence. The algorithm studied here is a model-based method called distributionally robust value iteration, which is shown to be near-optimal for the full range of uncertainty levels. Somewhat surprisingly, our results uncover that RMDPs are not necessarily easier or harder to learn than standard MDPs. The statistical consequence incurred by the robustness requirement depends heavily on the size and shape of the uncertainty set: in the case w.r.t.~the TV distance, the minimax sample complexity of RMDPs is always smaller than that of standard MDPs; in the case w.r.t.~the chi-squared divergence, the sample complexity of RMDPs far exceeds the standard MDP counterpart.

The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model

TL;DR

This work analyzes how distributional robustness affects data efficiency in reinforcement learning when a generative model is available. By studying RMDPs with TV distance and χ2 divergence, the authors develop a model-based distributionally robust value iteration (DRVI) algorithm and provide minimax-tight upper and lower bounds on sample complexity across uncertainty levels. A nuanced picture emerges: TV-based robustness can make learning easier or comparable to standard MDPs, while χ2-based robustness typically increases sample complexity, especially for large uncertainty. The results extend to offline RL with uniform data coverage and reveal that the uncertainty-set geometry crucially determines the statistical cost of robustness. Overall, robustness in RL is not universally harder or easier to learn; its impact depends sensitively on the chosen divergence and the shape of the uncertainty set, with concrete implications for when to privilege robustness in practice.

Abstract

This paper investigates model robustness in reinforcement learning (RL) to reduce the sim-to-real gap in practice. We adopt the framework of distributionally robust Markov decision processes (RMDPs), aimed at learning a policy that optimizes the worst-case performance when the deployed environment falls within a prescribed uncertainty set around the nominal MDP. Despite recent efforts, the sample complexity of RMDPs remained mostly unsettled regardless of the uncertainty set in use. It was unclear if distributional robustness bears any statistical consequences when benchmarked against standard RL. Assuming access to a generative model that draws samples based on the nominal MDP, we provide a near-optimal characterization of the sample complexity of RMDPs when the uncertainty set is specified via either the total variation (TV) distance or chi-squared divergence. The algorithm studied here is a model-based method called distributionally robust value iteration, which is shown to be near-optimal for the full range of uncertainty levels. Somewhat surprisingly, our results uncover that RMDPs are not necessarily easier or harder to learn than standard MDPs. The statistical consequence incurred by the robustness requirement depends heavily on the size and shape of the uncertainty set: in the case w.r.t.~the TV distance, the minimax sample complexity of RMDPs is always smaller than that of standard MDPs; in the case w.r.t.~the chi-squared divergence, the sample complexity of RMDPs far exceeds the standard MDP counterpart.
Paper Structure (136 sections, 24 theorems, 272 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 136 sections, 24 theorems, 272 equations, 5 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Let the uncertainty set be $\mathcal{U}_\rho^\sigma(\cdot) = \mathcal{U}^{\sigma}_{\mathsf{TV}}(\cdot)$, as specified by the TV distance eq:tv-distance. Consider any discount factor $\gamma \in \left[\frac{1}{4},1 \right)$, uncertainty level $\sigma\in (0,1)$, and $\delta \in (0,1)$. Let $\widehat{\ for any $\varepsilon \in \left(0, \sqrt{1/\max\{1-\gamma, \sigma\}} \right]$, as long as the total

Figures (5)

  • Figure 1: Illustrations of the obtained sample complexity upper and lower bounds for learning RMDPs with comparisons to state-of-the-art and the sample complexity of standard MDPs, where the uncertainty set is specified using the TV distance (a) and the $\chi^2$ divergence (b).
  • Figure 2: Distributionally robust value iteration ( DRVI) for infinite-horizon RMDPs.
  • Figure 3: The constructed hard robust MDP instance for the lower bound.
  • Figure 4: Illustrations of the considered MDP.
  • Figure 5: Sample complexity of DRVI with an uncertainty set under the TV distance (a-b) and the $\chi^2$ divergence (c-d), with respect to the effective horizon $1/(1-\gamma)$ and $\sigma$.

Theorems & Definitions (28)

  • Theorem 1: Upper bound under TV distance
  • Remark 1
  • Theorem 2: Lower bound under TV distance
  • Theorem 3: Upper bound under $\chi^2$ divergence
  • Remark 2
  • Theorem 4: Lower bound under $\chi^2$ divergence
  • Lemma 3: $\gamma$-Contraction
  • Lemma 4
  • Remark 3: asymmetry in error decomposition
  • Corollary 1: Upper bound under TV distance
  • ...and 18 more