Wasserstein Robust Reinforcement Learning
Mohammed Amin Abdullah, Hang Ren, Haitham Bou Ammar, Vladimir Milenkovic, Rui Luo, Mingtian Zhang, Jun Wang
TL;DR
This work tackles the challenge of overfitting and poor generalisation in reinforcement learning by introducing WR$^2$L, a Wasserstein-robust RL framework that seeks the best policy under worst-case but bounded transition dynamics. The core idea is to constrain admissible dynamics within an $\epsilon$-Wasserstein ball around a reference model $\mathcal{P}_0$ and to solve a min–max objective over policy parameters $\theta$ and dynamics parameters $\phi$. A key novelty is the alternating descent-ascent optimisation, with a second-order Taylor approximation of the Wasserstein distance yielding a closed-form update for $\phi$ and a gradient-based update for $\theta$, plus a zero-order method to estimate gradients and Hessians when dynamics are treated as black-box simulators. Empirically, WR$^2$L demonstrates superior robustness compared to standard and prior robust RL approaches on MuJoCo benchmarks, including high-dimensional variations, highlighting its practical impact for real-world, uncertain environments. The work also provides a scalable zero-order solver and analytic results for the Hessian-based constraint handling, offering a general tool for robust optimisation in dynamical systems.
Abstract
Reinforcement learning algorithms, though successful, tend to over-fit to training environments hampering their application to the real-world. This paper proposes $\text{W}\text{R}^{2}\text{L}$ -- a robust reinforcement learning algorithm with significant robust performance on low and high-dimensional control tasks. Our method formalises robust reinforcement learning as a novel min-max game with a Wasserstein constraint for a correct and convergent solver. Apart from the formulation, we also propose an efficient and scalable solver following a novel zero-order optimisation method that we believe can be useful to numerical optimisation in general. We empirically demonstrate significant gains compared to standard and robust state-of-the-art algorithms on high-dimensional MuJuCo environments.
