Distributionally Robust Reinforcement Learning with Interactive Data Collection: Fundamental Hardness and Near-Optimal Algorithm

Miao Lu; Han Zhong; Tong Zhang; Jose Blanchet

Distributionally Robust Reinforcement Learning with Interactive Data Collection: Fundamental Hardness and Near-Optimal Algorithm

Miao Lu, Han Zhong, Tong Zhang, Jose Blanchet

TL;DR

This work introduces the vanishing minimal value assumption to RMDPs with a total-variation (TV) distance robust set, and proves that such an assumption effectively eliminates the support shift issue for RMDPs with a TV distance robust set, and presents an algorithm with a provable sample complexity guarantee.

Abstract

The sim-to-real gap, which represents the disparity between training and testing environments, poses a significant challenge in reinforcement learning (RL). A promising approach to addressing this challenge is distributionally robust RL, often framed as a robust Markov decision process (RMDP). In this framework, the objective is to find a robust policy that achieves good performance under the worst-case scenario among all environments within a pre-specified uncertainty set centered around the training environment. Unlike previous work, which relies on a generative model or a pre-collected offline dataset enjoying good coverage of the deployment environment, we tackle robust RL via interactive data collection, where the learner interacts with the training environment only and refines the policy through trial and error. In this robust RL paradigm, two main challenges emerge: managing distributional robustness while striking a balance between exploration and exploitation during data collection. Initially, we establish that sample-efficient learning without additional assumptions is unattainable owing to the curse of support shift; i.e., the potential disjointedness of the distributional supports between the training and testing environments. To circumvent such a hardness result, we introduce the vanishing minimal value assumption to RMDPs with a total-variation (TV) distance robust set, postulating that the minimal value of the optimal robust value function is zero. We prove that such an assumption effectively eliminates the support shift issue for RMDPs with a TV distance robust set, and present an algorithm with a provable sample complexity guarantee. Our work makes the initial step to uncovering the inherent difficulty of robust RL via interactive data collection and sufficient conditions for designing a sample-efficient algorithm accompanied by sharp sample complexity analysis.

Distributionally Robust Reinforcement Learning with Interactive Data Collection: Fundamental Hardness and Near-Optimal Algorithm

TL;DR

Abstract

Paper Structure (65 sections, 24 theorems, 209 equations, 1 figure, 1 table, 1 algorithm)

This paper contains 65 sections, 24 theorems, 209 equations, 1 figure, 1 table, 1 algorithm.

Introduction
Contributions
Fundamental hardness.
Identifying a tractable case.
Efficient algorithm with sharp sample complexity.
Related Works
Robust reinforcement learning in robust Markov decision processes.
Sample-efficient online non-robust reinforcement learning.
Corruption robust reinforcement learning.
Notations
Preliminaries
Robust Markov Decision Processes
${\mathcal{S}}\times\mathcal{A}$-rectangularity and robust Bellman equations.
Total-variation distance robust set.
Robust RL with Interactive Data Collection
...and 50 more sections

Key Result

Proposition 2.2

Under Assumption ass: sa, for any transition $P=\{P_h\}_{h=1}^H\subseteq\mathcal{P}$ and any policy $\pi=\{\pi_h\}_{h=1}^H$ with $\pi_h:\mathcal{S}\mapsto\Delta(\mathcal{A})$, it holds that for any $(s,a,h)\in{\mathcal{S}}\times\mathcal{A}\times[H]$,

Figures (1)

Figure 1: Illustration of the hard example in Example \ref{['exp: hard']}. The solid lines represent possible transitions of the nominal transition kernel. The dashed lines represent the transitions induced by the worst case transition kernel in the robust set. The red solid line represents the transition where the two RMDP instances differ in that different actions lead to higher transition probability from $s_{\mathrm{bad}}$ to $s_{\mathrm{good}}$. We notice that when starting from $s_1 = s_{\mathrm{good}}$, the nominal transition kernel keeps the agent at $s_{\mathrm{good}}$ and no information at $s_{\mathrm{bad}}$ is revealed.

Theorems & Definitions (62)

Proposition 2.2: Robust Bellman equation
Proposition 2.3: Robust Bellman optimal equation
Definition 2.4: Total-variation distance robust set
Proposition 2.5: Strong duality representation
proof : Proof of Proposition \ref{['prop: strong duality']}
Remark 2.6
Proposition 2.7: Gap between maximum and minimum
proof : Proof of Proposition \ref{['prop: gap']}
Example 3.1: Hard example of robust RL with interactive data collection
Theorem 3.2: Hardness result (based on Example \ref{['exp: hard']})
...and 52 more

Distributionally Robust Reinforcement Learning with Interactive Data Collection: Fundamental Hardness and Near-Optimal Algorithm

TL;DR

Abstract

Distributionally Robust Reinforcement Learning with Interactive Data Collection: Fundamental Hardness and Near-Optimal Algorithm

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (62)