Table of Contents
Fetching ...

Time-Constrained Robust MDPs

Adil Zouitine, David Bertoin, Pierre Clavier, Matthieu Geist, Emmanuel Rachelson

TL;DR

This work introduces Time-Constrained Robust MDPs (TC-RMDPs) to address the conservatism of traditional robust RL that relies on $sa$-rectangular uncertainty. By parameterizing transition dynamics with a vector $\psi$ and constraining its temporal evolution with a Lipschitz bound $L$, the authors formulate a time-aware robust MDP and develop three algorithms—Oracle-TC, Stacked-TC, and Vanilla-TC—that integrate into robust value iteration via a TC Bellman operator. The framework yields non-conservative yet robust policies, demonstrated on MuJoCo benchmarks, with theoretical guarantees including the contraction of TC operators and Lipschitz bounds on the robust objective. Empirically, TC-RMDP variants outperform standard deep robust RL methods and domain randomization in time-constrained and static settings, offering a practical path to robust,Real-world RL under temporally coupled disturbances. The results advance robust RL by balancing performance and robustness under realistic, time-evolving uncertainties, with solid theoretical support and detailed empirical validation.

Abstract

Robust reinforcement learning is essential for deploying reinforcement learning algorithms in real-world scenarios where environmental uncertainty predominates. Traditional robust reinforcement learning often depends on rectangularity assumptions, where adverse probability measures of outcome states are assumed to be independent across different states and actions. This assumption, rarely fulfilled in practice, leads to overly conservative policies. To address this problem, we introduce a new time-constrained robust MDP (TC-RMDP) formulation that considers multifactorial, correlated, and time-dependent disturbances, thus more accurately reflecting real-world dynamics. This formulation goes beyond the conventional rectangularity paradigm, offering new perspectives and expanding the analytical framework for robust RL. We propose three distinct algorithms, each using varying levels of environmental information, and evaluate them extensively on continuous control benchmarks. Our results demonstrate that these algorithms yield an efficient tradeoff between performance and robustness, outperforming traditional deep robust RL methods in time-constrained environments while preserving robustness in classical benchmarks. This study revisits the prevailing assumptions in robust RL and opens new avenues for developing more practical and realistic RL applications.

Time-Constrained Robust MDPs

TL;DR

This work introduces Time-Constrained Robust MDPs (TC-RMDPs) to address the conservatism of traditional robust RL that relies on -rectangular uncertainty. By parameterizing transition dynamics with a vector and constraining its temporal evolution with a Lipschitz bound , the authors formulate a time-aware robust MDP and develop three algorithms—Oracle-TC, Stacked-TC, and Vanilla-TC—that integrate into robust value iteration via a TC Bellman operator. The framework yields non-conservative yet robust policies, demonstrated on MuJoCo benchmarks, with theoretical guarantees including the contraction of TC operators and Lipschitz bounds on the robust objective. Empirically, TC-RMDP variants outperform standard deep robust RL methods and domain randomization in time-constrained and static settings, offering a practical path to robust,Real-world RL under temporally coupled disturbances. The results advance robust RL by balancing performance and robustness under realistic, time-evolving uncertainties, with solid theoretical support and detailed empirical validation.

Abstract

Robust reinforcement learning is essential for deploying reinforcement learning algorithms in real-world scenarios where environmental uncertainty predominates. Traditional robust reinforcement learning often depends on rectangularity assumptions, where adverse probability measures of outcome states are assumed to be independent across different states and actions. This assumption, rarely fulfilled in practice, leads to overly conservative policies. To address this problem, we introduce a new time-constrained robust MDP (TC-RMDP) formulation that considers multifactorial, correlated, and time-dependent disturbances, thus more accurately reflecting real-world dynamics. This formulation goes beyond the conventional rectangularity paradigm, offering new perspectives and expanding the analytical framework for robust RL. We propose three distinct algorithms, each using varying levels of environmental information, and evaluate them extensively on continuous control benchmarks. Our results demonstrate that these algorithms yield an efficient tradeoff between performance and robustness, outperforming traditional deep robust RL methods in time-constrained environments while preserving robustness in classical benchmarks. This study revisits the prevailing assumptions in robust RL and opens new avenues for developing more practical and realistic RL applications.
Paper Structure (26 sections, 2 theorems, 28 equations, 15 figures, 16 tables, 2 algorithms)

This paper contains 26 sections, 2 theorems, 28 equations, 15 figures, 16 tables, 2 algorithms.

Key Result

Theorem 2.1

The time-constrained (TC) Bellman operators $T^\pi_B$ and $T^*_B$ are contraction mappings. Thus the sequences $v_{n+1}=T^\pi_B v_n$ and $v_{n+1} = T^*_B v_n$, converge to their respective fixed points $v^\pi_B$ and $v^*_B$.

Figures (15)

  • Figure 1: TC-RMDP training involves a temporally-constrained adversary aiming to maximize the effect of temporally-coupled perturbations. Conversely, the agent aims to optimize its performance against this time-constrained adversary. In orange, the oracle observation, and in blue the stacked observation.
  • Figure 2: Evaluation against a random fixed adversary, with a radius $L=0.1$
  • Figure 3: Actor critic neural network architecture
  • Figure 4: Episodic reward of the trained agent during the training of the adversary across different environments. Each plot represents the performance over 5 million timesteps, with rewards averaged across 10 seeds. The perturbation radius is set to $L=0.001$ for all adversaries.
  • Figure 5: Averaged training curves for the Domain Randomization method over 10 seeds
  • ...and 10 more figures

Theorems & Definitions (8)

  • Theorem 2.1
  • Definition 6.1: Reward/Kernel Lipchitz TC-RMDPs lecarpentier2019non
  • Theorem 6.2
  • proof
  • Definition C.1: Lipschitz of sequence of MDPs
  • Definition C.2
  • Definition C.3: Robust (optimal) Return of NS-RMDPs
  • proof : Proof of Theorem \ref{['th:lipchitz']}