Distributionally Robust Self Paced Curriculum Reinforcement Learning
Anirudh Satheesh, Keenan Powell, Vaneet Aggarwal
TL;DR
DR-SPCRL presents a continuous, self-paced curriculum for robustness in reinforcement learning by treating the uncertainty budget $\epsilon$ as a learnable context and leveraging the dual variable $\beta^*$ to regulate progression. By applying the Envelope Theorem, the authors show that the gradient of the robust value with respect to $\epsilon$ equals $-\beta^*$, enabling an explicit update: $\epsilon_{t+1}=\epsilon_t-\lambda_{curr}(\beta^*(\epsilon_t)+2\alpha(\epsilon_t-\epsilon_{budget}))$, with $\beta^*$ approximated by a neural network $\beta_\phi$. The method is RL-algorithm-agnostic and demonstrates improved robustness–performance trade-offs (averaging $11.8\%$ higher returns under perturbations and roughly $1.9\times$ the nominal RL performance) across multiple continuous-control environments and base algorithms (PPO, SAC, DDPG). This adaptive curriculum stabilizes training and enhances deployment reliability in the face of sim-to-real gaps and other environmental uncertainties, suggesting strong practical impact for real-world robotics and other systems subject to distribution shifts.
Abstract
A central challenge in reinforcement learning is that policies trained in controlled environments often fail under distribution shifts at deployment into real-world environments. Distributionally Robust Reinforcement Learning (DRRL) addresses this by optimizing for worst-case performance within an uncertainty set defined by a robustness budget $ε$. However, fixing $ε$ results in a tradeoff between performance and robustness: small values yield high nominal performance but weak robustness, while large values can result in instability and overly conservative policies. We propose Distributionally Robust Self-Paced Curriculum Reinforcement Learning (DR-SPCRL), a method that overcomes this limitation by treating $ε$ as a continuous curriculum. DR-SPCRL adaptively schedules the robustness budget according to the agent's progress, enabling a balance between nominal and robust performance. Empirical results across multiple environments demonstrate that DR-SPCRL not only stabilizes training but also achieves a superior robustness-performance trade-off, yielding an average 11.8\% increase in episodic return under varying perturbations compared to fixed or heuristic scheduling strategies, and achieving approximately 1.9$\times$ the performance of the corresponding nominal RL algorithms.
