Distributionally Robust Self Paced Curriculum Reinforcement Learning

Anirudh Satheesh; Keenan Powell; Vaneet Aggarwal

Distributionally Robust Self Paced Curriculum Reinforcement Learning

Anirudh Satheesh, Keenan Powell, Vaneet Aggarwal

TL;DR

DR-SPCRL presents a continuous, self-paced curriculum for robustness in reinforcement learning by treating the uncertainty budget $\epsilon$ as a learnable context and leveraging the dual variable $\beta^*$ to regulate progression. By applying the Envelope Theorem, the authors show that the gradient of the robust value with respect to $\epsilon$ equals $-\beta^*$, enabling an explicit update: $\epsilon_{t+1}=\epsilon_t-\lambda_{curr}(\beta^*(\epsilon_t)+2\alpha(\epsilon_t-\epsilon_{budget}))$, with $\beta^*$ approximated by a neural network $\beta_\phi$. The method is RL-algorithm-agnostic and demonstrates improved robustness–performance trade-offs (averaging $11.8\%$ higher returns under perturbations and roughly $1.9\times$ the nominal RL performance) across multiple continuous-control environments and base algorithms (PPO, SAC, DDPG). This adaptive curriculum stabilizes training and enhances deployment reliability in the face of sim-to-real gaps and other environmental uncertainties, suggesting strong practical impact for real-world robotics and other systems subject to distribution shifts.

Abstract

A central challenge in reinforcement learning is that policies trained in controlled environments often fail under distribution shifts at deployment into real-world environments. Distributionally Robust Reinforcement Learning (DRRL) addresses this by optimizing for worst-case performance within an uncertainty set defined by a robustness budget $ε$. However, fixing $ε$ results in a tradeoff between performance and robustness: small values yield high nominal performance but weak robustness, while large values can result in instability and overly conservative policies. We propose Distributionally Robust Self-Paced Curriculum Reinforcement Learning (DR-SPCRL), a method that overcomes this limitation by treating $ε$ as a continuous curriculum. DR-SPCRL adaptively schedules the robustness budget according to the agent's progress, enabling a balance between nominal and robust performance. Empirical results across multiple environments demonstrate that DR-SPCRL not only stabilizes training but also achieves a superior robustness-performance trade-off, yielding an average 11.8\% increase in episodic return under varying perturbations compared to fixed or heuristic scheduling strategies, and achieving approximately 1.9$\times$ the performance of the corresponding nominal RL algorithms.

Distributionally Robust Self Paced Curriculum Reinforcement Learning

TL;DR

DR-SPCRL presents a continuous, self-paced curriculum for robustness in reinforcement learning by treating the uncertainty budget

as a learnable context and leveraging the dual variable

to regulate progression. By applying the Envelope Theorem, the authors show that the gradient of the robust value with respect to

equals

, enabling an explicit update:

, with

approximated by a neural network

. The method is RL-algorithm-agnostic and demonstrates improved robustness–performance trade-offs (averaging

higher returns under perturbations and roughly

the nominal RL performance) across multiple continuous-control environments and base algorithms (PPO, SAC, DDPG). This adaptive curriculum stabilizes training and enhances deployment reliability in the face of sim-to-real gaps and other environmental uncertainties, suggesting strong practical impact for real-world robotics and other systems subject to distribution shifts.

Abstract

. However, fixing

results in a tradeoff between performance and robustness: small values yield high nominal performance but weak robustness, while large values can result in instability and overly conservative policies. We propose Distributionally Robust Self-Paced Curriculum Reinforcement Learning (DR-SPCRL), a method that overcomes this limitation by treating

as a continuous curriculum. DR-SPCRL adaptively schedules the robustness budget according to the agent's progress, enabling a balance between nominal and robust performance. Empirical results across multiple environments demonstrate that DR-SPCRL not only stabilizes training but also achieves a superior robustness-performance trade-off, yielding an average 11.8\% increase in episodic return under varying perturbations compared to fixed or heuristic scheduling strategies, and achieving approximately 1.9

the performance of the corresponding nominal RL algorithms.

Distributionally Robust Self Paced Curriculum Reinforcement Learning

TL;DR

Abstract

Distributionally Robust Self Paced Curriculum Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)