Table of Contents
Fetching ...

Adversarial Diffusion for Robust Reinforcement Learning

Daniele Foffano, Alessio Russo, Alexandre Proutiere

TL;DR

This work tackles robustness in reinforcement learning by integrating diffusion-based trajectory modeling with CVaR-based adversarial risk optimization. It introduces Adversarial Diffusion for Robust Reinforcement Learning (AD-RRL), where adversarially guided diffusion samples worst-case trajectories to train policies within a Dyna-style loop, increasing resilience to modeling errors and environmental uncertainty. The approach yields stronger robustness than state-of-the-art baselines on MuJoCo tasks while maintaining competitive nominal performance, though at higher computational cost due to diffusion steps. This framework advances practical robust RL by combining trajectory-level generation with tail-risk optimization, offering a scalable path toward safer, more reliable policies in uncertain dynamics.

Abstract

Robustness to modeling errors and uncertainties remains a central challenge in reinforcement learning (RL). In this work, we address this challenge by leveraging diffusion models to train robust RL policies. Diffusion models have recently gained popularity in model-based RL due to their ability to generate full trajectories "all at once", mitigating the compounding errors typical of step-by-step transition models. Moreover, they can be conditioned to sample from specific distributions, making them highly flexible. We leverage conditional sampling to learn policies that are robust to uncertainty in environment dynamics. Building on the established connection between Conditional Value at Risk (CVaR) optimization and robust RL, we introduce Adversarial Diffusion for Robust Reinforcement Learning (AD-RRL). AD-RRL guides the diffusion process to generate worst-case trajectories during training, effectively optimizing the CVaR of the cumulative return. Empirical results across standard benchmarks show that AD-RRL achieves superior robustness and performance compared to existing robust RL methods.

Adversarial Diffusion for Robust Reinforcement Learning

TL;DR

This work tackles robustness in reinforcement learning by integrating diffusion-based trajectory modeling with CVaR-based adversarial risk optimization. It introduces Adversarial Diffusion for Robust Reinforcement Learning (AD-RRL), where adversarially guided diffusion samples worst-case trajectories to train policies within a Dyna-style loop, increasing resilience to modeling errors and environmental uncertainty. The approach yields stronger robustness than state-of-the-art baselines on MuJoCo tasks while maintaining competitive nominal performance, though at higher computational cost due to diffusion steps. This framework advances practical robust RL by combining trajectory-level generation with tail-risk optimization, offering a scalable path toward safer, more reliable policies in uncertain dynamics.

Abstract

Robustness to modeling errors and uncertainties remains a central challenge in reinforcement learning (RL). In this work, we address this challenge by leveraging diffusion models to train robust RL policies. Diffusion models have recently gained popularity in model-based RL due to their ability to generate full trajectories "all at once", mitigating the compounding errors typical of step-by-step transition models. Moreover, they can be conditioned to sample from specific distributions, making them highly flexible. We leverage conditional sampling to learn policies that are robust to uncertainty in environment dynamics. Building on the established connection between Conditional Value at Risk (CVaR) optimization and robust RL, we introduce Adversarial Diffusion for Robust Reinforcement Learning (AD-RRL). AD-RRL guides the diffusion process to generate worst-case trajectories during training, effectively optimizing the CVaR of the cumulative return. Empirical results across standard benchmarks show that AD-RRL achieves superior robustness and performance compared to existing robust RL methods.

Paper Structure

This paper contains 37 sections, 3 theorems, 46 equations, 7 figures, 5 tables, 3 algorithms.

Key Result

Lemma 4.1

Assume that the denoising process is Gaussian, that is (eq:gausdiff) holds. Assume that for all $i\in [N]$, the approximation ${p_{\boldsymbol{\theta}}}(\boldsymbol \tau_i \in C_\alpha | \boldsymbol \tau_i) = \exp{(-c_i\sum_{t = 1}^H \gamma^t r_t^{(i)})}$ holds. Then, we can sample trajectories from where $\boldsymbol{g}_i = \nabla_{\boldsymbol{\tau}} Z(\mu_{\boldsymbol{\theta}}(\boldsymbol \tau_i

Figures (7)

  • Figure 1: A high-level overview of AD-RRL. Following a Dyna-like structure sutton1991dyna, the algorithm iteratively: samples trajectories from the real environment with a given policy $\pi$, improves the Diffusion models on the collected data, uses an Adversarially Guided Diffusion model (Section \ref{['sec:adversarially_guided_diffusion']}) to generate challenging synthetic trajectories which in turn are used to improve $\pi$. The loop is repeated until convergence to the optimal policy $\pi^\star$.
  • Figure 2: Average return across variations in selected physics parameters. Shaded regions indicate $\pm$ one standard error.
  • Figure 3: Training-return curves on the nominal environment for five MuJoCo tasks. Shaded areas represent one standard error over 5 runs.
  • Figure 4: Average Return for varying physical parameters. Shaded areas represent one standard error over 5 runs.
  • Figure 5: Training-return curves on the nominal environment for the Model-Free baslines trained on $3$M samples. Shaded areas represent one standard error over 5 runs. The dashed line represents the reference final cumulative reward achieved by AD-RRL trained on $1.5$M samples.
  • ...and 2 more figures

Theorems & Definitions (8)

  • Lemma 4.1
  • Lemma 4.2
  • Proposition 4.3
  • Remark 4.4
  • proof : Proof of \ref{['lem:perturbed_f']}
  • proof : Proof of \ref{['eq:cond_sampling']}
  • proof : Proof of \ref{['lem:multiplicative_noise']}
  • proof : Proof of \ref{['prop:ci_constraint']}