Table of Contents
Fetching ...

Wasserstein Robust Reinforcement Learning

Mohammed Amin Abdullah, Hang Ren, Haitham Bou Ammar, Vladimir Milenkovic, Rui Luo, Mingtian Zhang, Jun Wang

TL;DR

This work tackles the challenge of overfitting and poor generalisation in reinforcement learning by introducing WR$^2$L, a Wasserstein-robust RL framework that seeks the best policy under worst-case but bounded transition dynamics. The core idea is to constrain admissible dynamics within an $\epsilon$-Wasserstein ball around a reference model $\mathcal{P}_0$ and to solve a min–max objective over policy parameters $\theta$ and dynamics parameters $\phi$. A key novelty is the alternating descent-ascent optimisation, with a second-order Taylor approximation of the Wasserstein distance yielding a closed-form update for $\phi$ and a gradient-based update for $\theta$, plus a zero-order method to estimate gradients and Hessians when dynamics are treated as black-box simulators. Empirically, WR$^2$L demonstrates superior robustness compared to standard and prior robust RL approaches on MuJoCo benchmarks, including high-dimensional variations, highlighting its practical impact for real-world, uncertain environments. The work also provides a scalable zero-order solver and analytic results for the Hessian-based constraint handling, offering a general tool for robust optimisation in dynamical systems.

Abstract

Reinforcement learning algorithms, though successful, tend to over-fit to training environments hampering their application to the real-world. This paper proposes $\text{W}\text{R}^{2}\text{L}$ -- a robust reinforcement learning algorithm with significant robust performance on low and high-dimensional control tasks. Our method formalises robust reinforcement learning as a novel min-max game with a Wasserstein constraint for a correct and convergent solver. Apart from the formulation, we also propose an efficient and scalable solver following a novel zero-order optimisation method that we believe can be useful to numerical optimisation in general. We empirically demonstrate significant gains compared to standard and robust state-of-the-art algorithms on high-dimensional MuJuCo environments.

Wasserstein Robust Reinforcement Learning

TL;DR

This work tackles the challenge of overfitting and poor generalisation in reinforcement learning by introducing WRL, a Wasserstein-robust RL framework that seeks the best policy under worst-case but bounded transition dynamics. The core idea is to constrain admissible dynamics within an -Wasserstein ball around a reference model and to solve a min–max objective over policy parameters and dynamics parameters . A key novelty is the alternating descent-ascent optimisation, with a second-order Taylor approximation of the Wasserstein distance yielding a closed-form update for and a gradient-based update for , plus a zero-order method to estimate gradients and Hessians when dynamics are treated as black-box simulators. Empirically, WRL demonstrates superior robustness compared to standard and prior robust RL approaches on MuJoCo benchmarks, including high-dimensional variations, highlighting its practical impact for real-world, uncertain environments. The work also provides a scalable zero-order solver and analytic results for the Hessian-based constraint handling, offering a general tool for robust optimisation in dynamical systems.

Abstract

Reinforcement learning algorithms, though successful, tend to over-fit to training environments hampering their application to the real-world. This paper proposes -- a robust reinforcement learning algorithm with significant robust performance on low and high-dimensional control tasks. Our method formalises robust reinforcement learning as a novel min-max game with a Wasserstein constraint for a correct and convergent solver. Apart from the formulation, we also propose an efficient and scalable solver following a novel zero-order optimisation method that we believe can be useful to numerical optimisation in general. We empirically demonstrate significant gains compared to standard and robust state-of-the-art algorithms on high-dimensional MuJuCo environments.

Paper Structure

This paper contains 29 sections, 2 theorems, 31 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

For a fixed $\bm{\theta}$ and $\bm{\phi}$, the gradient can be computed as:

Figures (5)

  • Figure 1: Robustness results on the inverted pendulum demonstrating that our method outperforms state-of-the-art in terms of average test returns.
  • Figure 2: Robustness results on Hopper (left) and Walker (right) systems demonstrating that our method outperforms others significantly in terms of average test returns as torso densities vary. It is also interesting to realise that due to the robust problem formulation, our algorithm exhibits a trade-off between optimality and generalisation. Hopper results are with a reference $\rho_0=1750$; PPO$_2$ uses the same implementation as PPO but trained with $\rho_0=3000$. Walker results are attained with a reference model of $\rho_0=1750$.
  • Figure 3: Results on various benchmarks. Top row represents Hopper results, middle is concerned with Walker, and bottom denotes HalfCheetah. These graphs depict test returns as a function of changes in dynamical parameters and Wasserstein distance. These graphs again show that $\text{W}\text{R}^{2}\text{L}$ outperforms PPO (i.e., when $\epsilon =0$) and that its robustness improves as $\epsilon$ increases.
  • Figure 4: Results evaluating performance when considering high-dimensional variations on the hopper (HP - top row) and HalfCheetah (HC - bottom row) environment. All figures show the empirical distribution of returns on 1,000 testing systems. Figure (a) demonstrates the robustness of PPO. Figure (b) reports empirical test returns of $\text{W}\text{R}^{2}\text{L}$'s policy trained on only two parameter changes (e.g., friction and density) of the environment but tested on systems with all high-dimensional dynamical parameters modified. Figure (c) trains and tests $\text{W}\text{R}^{2}\text{L}$ altering all dimensional parameters of the simulator. Clearly, our method exhibits robustness even if high-dimensional variations were considered.
  • Figure 5: Robustness results on the inverted pendulum demonstrating that our method outperforms state-of-the-art in terms of average test returns and that DDPG lacks in robustness performance.

Theorems & Definitions (4)

  • Proposition 1: Zero-Order Gradient Estimate
  • proof
  • Proposition 2: Zero-Order Hessian Estimate
  • proof