Table of Contents
Fetching ...

Bring Your Own (Non-Robust) Algorithm to Solve Robust MDPs by Estimating The Worst Kernel

Kaixin Wang, Uri Gadot, Navdeep Kumar, Kfir Levy, Shie Mannor

TL;DR

The paper tackles robustness in reinforcement learning under transition perturbations by reframing robust MDPs as a problem of estimating the worst transition kernel within a KL-based uncertainty set. The proposed EWoK approach keeps any standard non-robust RL algorithm intact while approximately sampling next states from the worst kernel, using a theoretical link $P^pi_{P}(s'|s,a) = \bar{P}^pi(s'|s,a) e^{-\delta^pi(s')}$ and an efficient approximation $\hat{\delta}^pi(s')$ derived from current value estimates. The method is proven to converge toward the true worst kernel and is demonstrated on tasks from Cartpole to the DeepMind Control Suite, showing improved robustness to perturbations compared with non-robust baselines and domain randomization. Its plug-and-play nature with any off-the-shelf RL algorithm enables scalable robust learning in high-dimensional domains, offering a practical pathway for deploying robust policies in real-world settings. Limitations include the need to repeatedly sample next states from the transition model, suggesting future work on integrating world models and offline or model-based variants to further reduce compounding errors.

Abstract

Robust Markov Decision Processes (RMDPs) provide a framework for sequential decision-making that is robust to perturbations on the transition kernel. However, current RMDP methods are often limited to small-scale problems, hindering their use in high-dimensional domains. To bridge this gap, we present EWoK, a novel online approach to solve RMDP that Estimates the Worst transition Kernel to learn robust policies. Unlike previous works that regularize the policy or value updates, EWoK achieves robustness by simulating the worst scenarios for the agent while retaining complete flexibility in the learning process. Notably, EWoK can be applied on top of any off-the-shelf {\em non-robust} RL algorithm, enabling easy scaling to high-dimensional domains. Our experiments, spanning from simple Cartpole to high-dimensional DeepMind Control Suite environments, demonstrate the effectiveness and applicability of the EWoK paradigm as a practical method for learning robust policies.

Bring Your Own (Non-Robust) Algorithm to Solve Robust MDPs by Estimating The Worst Kernel

TL;DR

The paper tackles robustness in reinforcement learning under transition perturbations by reframing robust MDPs as a problem of estimating the worst transition kernel within a KL-based uncertainty set. The proposed EWoK approach keeps any standard non-robust RL algorithm intact while approximately sampling next states from the worst kernel, using a theoretical link and an efficient approximation derived from current value estimates. The method is proven to converge toward the true worst kernel and is demonstrated on tasks from Cartpole to the DeepMind Control Suite, showing improved robustness to perturbations compared with non-robust baselines and domain randomization. Its plug-and-play nature with any off-the-shelf RL algorithm enables scalable robust learning in high-dimensional domains, offering a practical pathway for deploying robust policies in real-world settings. Limitations include the need to repeatedly sample next states from the transition model, suggesting future work on integrating world models and offline or model-based variants to further reduce compounding errors.

Abstract

Robust Markov Decision Processes (RMDPs) provide a framework for sequential decision-making that is robust to perturbations on the transition kernel. However, current RMDP methods are often limited to small-scale problems, hindering their use in high-dimensional domains. To bridge this gap, we present EWoK, a novel online approach to solve RMDP that Estimates the Worst transition Kernel to learn robust policies. Unlike previous works that regularize the policy or value updates, EWoK achieves robustness by simulating the worst scenarios for the agent while retaining complete flexibility in the learning process. Notably, EWoK can be applied on top of any off-the-shelf {\em non-robust} RL algorithm, enabling easy scaling to high-dimensional domains. Our experiments, spanning from simple Cartpole to high-dimensional DeepMind Control Suite environments, demonstrate the effectiveness and applicability of the EWoK paradigm as a practical method for learning robust policies.
Paper Structure (29 sections, 13 theorems, 63 equations, 13 figures, 9 tables, 1 algorithm)

This paper contains 29 sections, 13 theorems, 63 equations, 13 figures, 9 tables, 1 algorithm.

Key Result

Theorem 3.2

For a KL uncertainty set $\mathcal{P}$ and a policy $\pi$, a worst kernel is related to the nominal kernel through: where $\delta^\pi$ is of the form and satisfies

Figures (13)

  • Figure 1: The agent-environment interaction loop during training. Left: Existing methods typically regularize how an agent updates its policy to improve robustness. Right: Our work estimates a worst transition kernel, so the agent essentially learns its policy under the worst scenarios and can use any non-robust RL algorithm.
  • Figure 2: An illustration of how next states are sampled in the estimated worst kernel.
  • Figure 3: Cliff-Walking environment and experiment results. In the bottom 3 plots, the color indicates the learned value and the arrows indicate the actions under the policy.
  • Figure 4: An illustration of the experimental setting. Grey earth denotes the unperturbed (nominal) environment while colored earths denote perturbed environments.
  • Figure 5: Evaluation results on Cartpole with noise and environment parameters perturbations for both DDQN and PPO algorithms.
  • ...and 8 more figures

Theorems & Definitions (23)

  • Definition 3.1
  • Theorem 3.2
  • Proposition 3.2
  • Proposition 3.2
  • Theorem 3.3
  • Proposition A.1
  • proof
  • Theorem A.1
  • proof
  • Lemma A.2
  • ...and 13 more