On the Foundation of Distributionally Robust Reinforcement Learning

Shengbo Wang; Nian Si; Jose Blanchet; Zhengyuan Zhou

On the Foundation of Distributionally Robust Reinforcement Learning

Shengbo Wang, Nian Si, Jose Blanchet, Zhengyuan Zhou

TL;DR

This work builds a unified distributionally robust reinforcement learning framework by formalizing robust MDPs (RMDPs) with controller/adversary interactions under varied information structures (history-dependent, Markov, time-homogeneous) and rectangularity (SA and S). It derives a comprehensive dynamic programming analysis, establishing when the dynamic programming principle (DPP) holds and proving robustness-based Bellman equations, while also presenting counterexamples that reveal DPP failure under non-convex adversaries or certain policy classes. The authors introduce an asymptotically optimal history-dependent policy and connect DRRL to MSPs and stochastic games, providing practical guidance for model design and algorithm development. The results clarify when Markov time-homogeneous policies are sufficient and when richer, history-aware strategies are required to hedge against adversarial shifts in the deployment environment. Overall, the paper lays a rigorous theoretical foundation for DRRL and informs algorithmic choices across a broad spectrum of robust learning scenarios.

Abstract

Motivated by the need for a robust policy in the face of environment shifts between training and deployment, we contribute to the theoretical foundation of distributionally robust reinforcement learning (DRRL). This is accomplished through a comprehensive modeling framework centered around robust Markov decision processes (RMDPs). This framework obliges the decision maker to choose an optimal policy under the worst-case distributional shift orchestrated by an adversary. By unifying and extending existing formulations, we rigorously construct RMDPs that embrace various modeling attributes for both the decision maker and the adversary. These attributes include the structure of information availability-covering history-dependent, Markov, and Markov time-homogeneous dynamics-as well as constraints on the shifts induced by the adversary, with a focus on SA- and S-rectangularity. Within this RMDP framework, we investigate conditions for the existence or absence of the dynamic programming principle (DPP). From an algorithmic standpoint, the existence of DPP holds significant implications, as the vast majority of existing data and computationally efficient DRRL algorithms are reliant on the DPP. To investigate its existence, we systematically analyze various combinations of controller and adversary attributes, presenting streamlined proofs based on a unified methodology. We then construct counterexamples for settings where a fully general DPP fails to hold and establish asymptotically optimal history-dependent policies for key scenarios where the DPP is absent.

On the Foundation of Distributionally Robust Reinforcement Learning

TL;DR

Abstract

Paper Structure (52 sections, 21 theorems, 144 equations, 3 figures, 2 tables, 3 algorithms)

This paper contains 52 sections, 21 theorems, 144 equations, 3 figures, 2 tables, 3 algorithms.

Introduction
Results and Methodology
Literature Review
Remark on Paper Organization
Robust MDPs: Construction and Definitions
Controller's Policy
Adversary's Policy
SA-Rectangular Set of Adversary's Policies
S-Rectangular Set of Adversary's Policies
General Rectangular Set of Adversary's Policies
Adversary's Policy: Summary of Notations
The Max-Min Control Problem
Dynamic Programming Principles
Max-Min Optimal Values and Bellman Equation: The General Case
Deterministic Controller Policies
...and 37 more sections

Key Result

Lemma 1

Recall the definitions of the controller's and adversary's policy classes. Then, for any controller action set $\mathcal{Q}$ and S-rectangular adversary action set $\mathcal{P}^{\mathrm{S}}$, $v(\mu,\Pi_{\mathrm{H}},\mathrm{K})\geq v(\mu,\Pi_{\mathrm{M}},\mathrm{K})\geq v(\mu,\Pi_{\mathrm{S}},\mathr

Figures (3)

Figure 1: The adversarial actions in the adversary's action distribution set, where the red line and the blue line represent actions $a_1$ and $a_2$, respectively.
Figure 2: The adversarial actions in the action distribution set, where the red line and the blue line represent actions $a_1$ and $a_2$, respectively.
Figure 3: The extreme point of adversarial actions in the action distribution set, where the red line and the blue line represent actions $a_1$ and $a_2$, respectively.

Theorems & Definitions (58)

Example 1: Simple Inventory Model
Definition 1: Induced Probability Measure
Definition 2: Max-Min Control Value
Lemma 1
Definition 3: Robust Bellman Equation
Proposition 1
Definition 4: Dynamic Programming Principle
Theorem 1
Definition 5
Corollary 1.1
...and 48 more

On the Foundation of Distributionally Robust Reinforcement Learning

TL;DR

Abstract

On the Foundation of Distributionally Robust Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (58)