Table of Contents
Fetching ...

Distributionally Robust Constrained Reinforcement Learning under Strong Duality

Zhengfei Zhang, Kishan Panaganti, Laixi Shi, Yanan Sui, Adam Wierman, Yisong Yue

TL;DR

This work develops an algorithmic framework based on strong duality that enables the first efficient and provable solution in a class of environmental uncertainties and exposes an inherent structure of DRC-RL that arises from the combination of distributional robustness and constraints.

Abstract

We study the problem of Distributionally Robust Constrained RL (DRC-RL), where the goal is to maximize the expected reward subject to environmental distribution shifts and constraints. This setting captures situations where training and testing environments differ, and policies must satisfy constraints motivated by safety or limited budgets. Despite significant progress toward algorithm design for the separate problems of distributionally robust RL and constrained RL, there do not yet exist algorithms with end-to-end convergence guarantees for DRC-RL. We develop an algorithmic framework based on strong duality that enables the first efficient and provable solution in a class of environmental uncertainties. Further, our framework exposes an inherent structure of DRC-RL that arises from the combination of distributional robustness and constraints, which prevents a popular class of iterative methods from tractably solving DRC-RL, despite such frameworks being applicable for each of distributionally robust RL and constrained RL individually. Finally, we conduct experiments on a car racing benchmark to evaluate the effectiveness of the proposed algorithm.

Distributionally Robust Constrained Reinforcement Learning under Strong Duality

TL;DR

This work develops an algorithmic framework based on strong duality that enables the first efficient and provable solution in a class of environmental uncertainties and exposes an inherent structure of DRC-RL that arises from the combination of distributional robustness and constraints.

Abstract

We study the problem of Distributionally Robust Constrained RL (DRC-RL), where the goal is to maximize the expected reward subject to environmental distribution shifts and constraints. This setting captures situations where training and testing environments differ, and policies must satisfy constraints motivated by safety or limited budgets. Despite significant progress toward algorithm design for the separate problems of distributionally robust RL and constrained RL, there do not yet exist algorithms with end-to-end convergence guarantees for DRC-RL. We develop an algorithmic framework based on strong duality that enables the first efficient and provable solution in a class of environmental uncertainties. Further, our framework exposes an inherent structure of DRC-RL that arises from the combination of distributional robustness and constraints, which prevents a popular class of iterative methods from tractably solving DRC-RL, despite such frameworks being applicable for each of distributionally robust RL and constrained RL individually. Finally, we conduct experiments on a car racing benchmark to evaluate the effectiveness of the proposed algorithm.
Paper Structure (37 sections, 14 theorems, 68 equations, 4 figures, 2 algorithms)

This paper contains 37 sections, 14 theorems, 68 equations, 4 figures, 2 algorithms.

Key Result

Proposition 3.1

When substituting $\Pi$ with its convex hull $Conv(\Pi)$ in the DRC-RL problem (conservative_form), strong duality holds if Slater's condition holds.

Figures (4)

  • Figure 1: The four bar graphs denote the constraints satisfaction (green means satisfied) when shifts of power, inertia, braking magnitude, and steering angle occur. The lower right figure indicates the value of the objective (higher is better) when the steering angle is shifted. All evaluations are based on the value function (accumulated rewards) of mixture policy $\hat{\pi}$.
  • Figure 2: A two states, two actions Markov decision process used in example \ref{['example']} : the left and the right figures present the transition probabilities for actions $a_0$ and $a_1$.
  • Figure 3: Car Racing environment
  • Figure 4: Full results with Four different shifts: Higher is better, Left two are constraints and the right one is the objective. The bar graphs of constraints satisfaction are directly produced from these results.

Theorems & Definitions (24)

  • Proposition 3.1
  • Proposition 3.2
  • Proposition 3.3
  • Theorem 3.5
  • Definition 5.1: Greedy Policy Enabling
  • Definition 5.2: Operator Linearity
  • Lemma 5.3
  • Theorem 5.4
  • Corollary 5.5
  • Proposition A.1: Lemma 2, scherrer2015approximate
  • ...and 14 more