Table of Contents
Fetching ...

Solving robust MDPs as a sequence of static RL problems

Adil Zouitine, Matthieu Geist, Emmanuel Rachelson

TL;DR

This work reexamines robustness in reinforcement learning by focusing on the static model of transition uncertainty, which is equivalent to the dynamic model under stationary policies and sa-rectangular uncertainty with a no-duality gap. It introduces IWOCS, an Incremental Worst-Case Search meta-algorithm, which progressively expands a discrete uncertainty set by solving a sequence of standard MDPs to approximate the robust policy. A Deep-IWOCS variant combines SAC for policy optimization with CMA-ES for worst-case transition search, and includes mechanisms like predictive-coding indicators to handle partial state coverage. Empirical results on classical benchmarks show IWOCS and Deep-IWOCS achieving competitive or superior worst-case and average performance compared to state-of-the-art robust RL methods, highlighting a new direction for decoupling policy optimization from adversarial dynamics. The work also identifies theoretical and practical questions, such as convergence guarantees under approximation, gradient-based worst-case search, and the impact of uncertainty-set design on robustness and efficiency.

Abstract

Designing control policies whose performance level is guaranteed to remain above a given threshold in a span of environments is a critical feature for the adoption of reinforcement learning (RL) in real-world applications. The search for such robust policies is a notoriously difficult problem, related to the so-called dynamic model of transition function uncertainty, where the environment dynamics are allowed to change at each time step. But in practical cases, one is rather interested in robustness to a span of static transition models throughout interaction episodes. The static model is known to be harder to solve than the dynamic one, and seminal algorithms, such as robust value iteration, as well as most recent works on deep robust RL, build upon the dynamic model. In this work, we propose to revisit the static model. We suggest an analysis of why solving the static model under some mild hypotheses is a reasonable endeavor, based on an equivalence with the dynamic model, and formalize the general intuition that robust MDPs can be solved by tackling a series of static problems. We introduce a generic meta-algorithm called IWOCS, which incrementally identifies worst-case transition models so as to guide the search for a robust policy. Discussion on IWOCS sheds light on new ways to decouple policy optimization and adversarial transition functions and opens new perspectives for analysis. We derive a deep RL version of IWOCS and demonstrate it is competitive with state-of-the-art algorithms on classical benchmarks.

Solving robust MDPs as a sequence of static RL problems

TL;DR

This work reexamines robustness in reinforcement learning by focusing on the static model of transition uncertainty, which is equivalent to the dynamic model under stationary policies and sa-rectangular uncertainty with a no-duality gap. It introduces IWOCS, an Incremental Worst-Case Search meta-algorithm, which progressively expands a discrete uncertainty set by solving a sequence of standard MDPs to approximate the robust policy. A Deep-IWOCS variant combines SAC for policy optimization with CMA-ES for worst-case transition search, and includes mechanisms like predictive-coding indicators to handle partial state coverage. Empirical results on classical benchmarks show IWOCS and Deep-IWOCS achieving competitive or superior worst-case and average performance compared to state-of-the-art robust RL methods, highlighting a new direction for decoupling policy optimization from adversarial dynamics. The work also identifies theoretical and practical questions, such as convergence guarantees under approximation, gradient-based worst-case search, and the impact of uncertainty-set design on robustness and efficiency.

Abstract

Designing control policies whose performance level is guaranteed to remain above a given threshold in a span of environments is a critical feature for the adoption of reinforcement learning (RL) in real-world applications. The search for such robust policies is a notoriously difficult problem, related to the so-called dynamic model of transition function uncertainty, where the environment dynamics are allowed to change at each time step. But in practical cases, one is rather interested in robustness to a span of static transition models throughout interaction episodes. The static model is known to be harder to solve than the dynamic one, and seminal algorithms, such as robust value iteration, as well as most recent works on deep robust RL, build upon the dynamic model. In this work, we propose to revisit the static model. We suggest an analysis of why solving the static model under some mild hypotheses is a reasonable endeavor, based on an equivalence with the dynamic model, and formalize the general intuition that robust MDPs can be solved by tackling a series of static problems. We introduce a generic meta-algorithm called IWOCS, which incrementally identifies worst-case transition models so as to guide the search for a robust policy. Discussion on IWOCS sheds light on new ways to decouple policy optimization and adversarial transition functions and opens new perspectives for analysis. We derive a deep RL version of IWOCS and demonstrate it is competitive with state-of-the-art algorithms on classical benchmarks.
Paper Structure (48 sections, 6 equations, 4 figures, 25 tables, 1 algorithm)

This paper contains 48 sections, 6 equations, 4 figures, 25 tables, 1 algorithm.

Figures (4)

  • Figure 1: Convergence to $V^*$ vs Bellman iterates (right) in the Windy walk grid-world (left).
  • Figure 2: Windy walk grid-world.
  • Figure 3: Network architectures
  • Figure 4: Counting how many policies are valid in each state, in Hopper 3