Table of Contents
Fetching ...

Distributionally Robust Self Paced Curriculum Reinforcement Learning

Anirudh Satheesh, Keenan Powell, Vaneet Aggarwal

TL;DR

DR-SPCRL presents a continuous, self-paced curriculum for robustness in reinforcement learning by treating the uncertainty budget $\epsilon$ as a learnable context and leveraging the dual variable $\beta^*$ to regulate progression. By applying the Envelope Theorem, the authors show that the gradient of the robust value with respect to $\epsilon$ equals $-\beta^*$, enabling an explicit update: $\epsilon_{t+1}=\epsilon_t-\lambda_{curr}(\beta^*(\epsilon_t)+2\alpha(\epsilon_t-\epsilon_{budget}))$, with $\beta^*$ approximated by a neural network $\beta_\phi$. The method is RL-algorithm-agnostic and demonstrates improved robustness–performance trade-offs (averaging $11.8\%$ higher returns under perturbations and roughly $1.9\times$ the nominal RL performance) across multiple continuous-control environments and base algorithms (PPO, SAC, DDPG). This adaptive curriculum stabilizes training and enhances deployment reliability in the face of sim-to-real gaps and other environmental uncertainties, suggesting strong practical impact for real-world robotics and other systems subject to distribution shifts.

Abstract

A central challenge in reinforcement learning is that policies trained in controlled environments often fail under distribution shifts at deployment into real-world environments. Distributionally Robust Reinforcement Learning (DRRL) addresses this by optimizing for worst-case performance within an uncertainty set defined by a robustness budget $ε$. However, fixing $ε$ results in a tradeoff between performance and robustness: small values yield high nominal performance but weak robustness, while large values can result in instability and overly conservative policies. We propose Distributionally Robust Self-Paced Curriculum Reinforcement Learning (DR-SPCRL), a method that overcomes this limitation by treating $ε$ as a continuous curriculum. DR-SPCRL adaptively schedules the robustness budget according to the agent's progress, enabling a balance between nominal and robust performance. Empirical results across multiple environments demonstrate that DR-SPCRL not only stabilizes training but also achieves a superior robustness-performance trade-off, yielding an average 11.8\% increase in episodic return under varying perturbations compared to fixed or heuristic scheduling strategies, and achieving approximately 1.9$\times$ the performance of the corresponding nominal RL algorithms.

Distributionally Robust Self Paced Curriculum Reinforcement Learning

TL;DR

DR-SPCRL presents a continuous, self-paced curriculum for robustness in reinforcement learning by treating the uncertainty budget as a learnable context and leveraging the dual variable to regulate progression. By applying the Envelope Theorem, the authors show that the gradient of the robust value with respect to equals , enabling an explicit update: , with approximated by a neural network . The method is RL-algorithm-agnostic and demonstrates improved robustness–performance trade-offs (averaging higher returns under perturbations and roughly the nominal RL performance) across multiple continuous-control environments and base algorithms (PPO, SAC, DDPG). This adaptive curriculum stabilizes training and enhances deployment reliability in the face of sim-to-real gaps and other environmental uncertainties, suggesting strong practical impact for real-world robotics and other systems subject to distribution shifts.

Abstract

A central challenge in reinforcement learning is that policies trained in controlled environments often fail under distribution shifts at deployment into real-world environments. Distributionally Robust Reinforcement Learning (DRRL) addresses this by optimizing for worst-case performance within an uncertainty set defined by a robustness budget . However, fixing results in a tradeoff between performance and robustness: small values yield high nominal performance but weak robustness, while large values can result in instability and overly conservative policies. We propose Distributionally Robust Self-Paced Curriculum Reinforcement Learning (DR-SPCRL), a method that overcomes this limitation by treating as a continuous curriculum. DR-SPCRL adaptively schedules the robustness budget according to the agent's progress, enabling a balance between nominal and robust performance. Empirical results across multiple environments demonstrate that DR-SPCRL not only stabilizes training but also achieves a superior robustness-performance trade-off, yielding an average 11.8\% increase in episodic return under varying perturbations compared to fixed or heuristic scheduling strategies, and achieving approximately 1.9 the performance of the corresponding nominal RL algorithms.

Paper Structure

This paper contains 23 sections, 22 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Robustness evaluation of DRPPO under action, observation, and environment perturbations for the Hopper environment. Each panel shows episodic returns as a function of noise levels. Policies trained with smaller fixed robustness budgets ($\epsilon$) fail to handle perturbations effectively, while larger $\epsilon$ values produce overly conservative behavior, leading to suboptimal policies. This highlights the tradeoff between robustness and nominal performance for fixed $\epsilon$ settings.
  • Figure 2: PPO Robustness under observation, action and environmental perturbations.
  • Figure 3: SAC Robustness under observation, action and environmental perturbations.
  • Figure 4: DDPG Robustness under observation, action and environmental perturbations.
  • Figure 5: Comparison of training curves across the four continuous-control environments under different curriculum strategies. DR-SPCRL demonstrates smoother and more stable learning dynamics compared to the non-robust baselines, often significantly increasing the nominal reward.
  • ...and 1 more figures