On the Foundation of Distributionally Robust Reinforcement Learning
Shengbo Wang, Nian Si, Jose Blanchet, Zhengyuan Zhou
TL;DR
This work builds a unified distributionally robust reinforcement learning framework by formalizing robust MDPs (RMDPs) with controller/adversary interactions under varied information structures (history-dependent, Markov, time-homogeneous) and rectangularity (SA and S). It derives a comprehensive dynamic programming analysis, establishing when the dynamic programming principle (DPP) holds and proving robustness-based Bellman equations, while also presenting counterexamples that reveal DPP failure under non-convex adversaries or certain policy classes. The authors introduce an asymptotically optimal history-dependent policy and connect DRRL to MSPs and stochastic games, providing practical guidance for model design and algorithm development. The results clarify when Markov time-homogeneous policies are sufficient and when richer, history-aware strategies are required to hedge against adversarial shifts in the deployment environment. Overall, the paper lays a rigorous theoretical foundation for DRRL and informs algorithmic choices across a broad spectrum of robust learning scenarios.
Abstract
Motivated by the need for a robust policy in the face of environment shifts between training and deployment, we contribute to the theoretical foundation of distributionally robust reinforcement learning (DRRL). This is accomplished through a comprehensive modeling framework centered around robust Markov decision processes (RMDPs). This framework obliges the decision maker to choose an optimal policy under the worst-case distributional shift orchestrated by an adversary. By unifying and extending existing formulations, we rigorously construct RMDPs that embrace various modeling attributes for both the decision maker and the adversary. These attributes include the structure of information availability-covering history-dependent, Markov, and Markov time-homogeneous dynamics-as well as constraints on the shifts induced by the adversary, with a focus on SA- and S-rectangularity. Within this RMDP framework, we investigate conditions for the existence or absence of the dynamic programming principle (DPP). From an algorithmic standpoint, the existence of DPP holds significant implications, as the vast majority of existing data and computationally efficient DRRL algorithms are reliant on the DPP. To investigate its existence, we systematically analyze various combinations of controller and adversary attributes, presenting streamlined proofs based on a unified methodology. We then construct counterexamples for settings where a fully general DPP fails to hold and establish asymptotically optimal history-dependent policies for key scenarios where the DPP is absent.
