Table of Contents
Fetching ...

Distributionally Robust Reinforcement Learning with Interactive Data Collection: Fundamental Hardness and Near-Optimal Algorithm

Miao Lu, Han Zhong, Tong Zhang, Jose Blanchet

TL;DR

This work introduces the vanishing minimal value assumption to RMDPs with a total-variation (TV) distance robust set, and proves that such an assumption effectively eliminates the support shift issue for RMDPs with a TV distance robust set, and presents an algorithm with a provable sample complexity guarantee.

Abstract

The sim-to-real gap, which represents the disparity between training and testing environments, poses a significant challenge in reinforcement learning (RL). A promising approach to addressing this challenge is distributionally robust RL, often framed as a robust Markov decision process (RMDP). In this framework, the objective is to find a robust policy that achieves good performance under the worst-case scenario among all environments within a pre-specified uncertainty set centered around the training environment. Unlike previous work, which relies on a generative model or a pre-collected offline dataset enjoying good coverage of the deployment environment, we tackle robust RL via interactive data collection, where the learner interacts with the training environment only and refines the policy through trial and error. In this robust RL paradigm, two main challenges emerge: managing distributional robustness while striking a balance between exploration and exploitation during data collection. Initially, we establish that sample-efficient learning without additional assumptions is unattainable owing to the curse of support shift; i.e., the potential disjointedness of the distributional supports between the training and testing environments. To circumvent such a hardness result, we introduce the vanishing minimal value assumption to RMDPs with a total-variation (TV) distance robust set, postulating that the minimal value of the optimal robust value function is zero. We prove that such an assumption effectively eliminates the support shift issue for RMDPs with a TV distance robust set, and present an algorithm with a provable sample complexity guarantee. Our work makes the initial step to uncovering the inherent difficulty of robust RL via interactive data collection and sufficient conditions for designing a sample-efficient algorithm accompanied by sharp sample complexity analysis.

Distributionally Robust Reinforcement Learning with Interactive Data Collection: Fundamental Hardness and Near-Optimal Algorithm

TL;DR

This work introduces the vanishing minimal value assumption to RMDPs with a total-variation (TV) distance robust set, and proves that such an assumption effectively eliminates the support shift issue for RMDPs with a TV distance robust set, and presents an algorithm with a provable sample complexity guarantee.

Abstract

The sim-to-real gap, which represents the disparity between training and testing environments, poses a significant challenge in reinforcement learning (RL). A promising approach to addressing this challenge is distributionally robust RL, often framed as a robust Markov decision process (RMDP). In this framework, the objective is to find a robust policy that achieves good performance under the worst-case scenario among all environments within a pre-specified uncertainty set centered around the training environment. Unlike previous work, which relies on a generative model or a pre-collected offline dataset enjoying good coverage of the deployment environment, we tackle robust RL via interactive data collection, where the learner interacts with the training environment only and refines the policy through trial and error. In this robust RL paradigm, two main challenges emerge: managing distributional robustness while striking a balance between exploration and exploitation during data collection. Initially, we establish that sample-efficient learning without additional assumptions is unattainable owing to the curse of support shift; i.e., the potential disjointedness of the distributional supports between the training and testing environments. To circumvent such a hardness result, we introduce the vanishing minimal value assumption to RMDPs with a total-variation (TV) distance robust set, postulating that the minimal value of the optimal robust value function is zero. We prove that such an assumption effectively eliminates the support shift issue for RMDPs with a TV distance robust set, and present an algorithm with a provable sample complexity guarantee. Our work makes the initial step to uncovering the inherent difficulty of robust RL via interactive data collection and sufficient conditions for designing a sample-efficient algorithm accompanied by sharp sample complexity analysis.
Paper Structure (65 sections, 24 theorems, 209 equations, 1 figure, 1 table, 1 algorithm)

This paper contains 65 sections, 24 theorems, 209 equations, 1 figure, 1 table, 1 algorithm.

Key Result

Proposition 2.2

Under Assumption ass: sa, for any transition $P=\{P_h\}_{h=1}^H\subseteq\mathcal{P}$ and any policy $\pi=\{\pi_h\}_{h=1}^H$ with $\pi_h:\mathcal{S}\mapsto\Delta(\mathcal{A})$, it holds that for any $(s,a,h)\in{\mathcal{S}}\times\mathcal{A}\times[H]$,

Figures (1)

  • Figure 1: Illustration of the hard example in Example \ref{['exp: hard']}. The solid lines represent possible transitions of the nominal transition kernel. The dashed lines represent the transitions induced by the worst case transition kernel in the robust set. The red solid line represents the transition where the two RMDP instances differ in that different actions lead to higher transition probability from $s_{\mathrm{bad}}$ to $s_{\mathrm{good}}$. We notice that when starting from $s_1 = s_{\mathrm{good}}$, the nominal transition kernel keeps the agent at $s_{\mathrm{good}}$ and no information at $s_{\mathrm{bad}}$ is revealed.

Theorems & Definitions (62)

  • Proposition 2.2: Robust Bellman equation
  • Proposition 2.3: Robust Bellman optimal equation
  • Definition 2.4: Total-variation distance robust set
  • Proposition 2.5: Strong duality representation
  • proof : Proof of Proposition \ref{['prop: strong duality']}
  • Remark 2.6
  • Proposition 2.7: Gap between maximum and minimum
  • proof : Proof of Proposition \ref{['prop: gap']}
  • Example 3.1: Hard example of robust RL with interactive data collection
  • Theorem 3.2: Hardness result (based on Example \ref{['exp: hard']})
  • ...and 52 more