Table of Contents
Fetching ...

GRAM: Generalization in Deep RL with a Robust Adaptation Module

James Queeney, Xiaoyi Cai, Alexander Schperberg, Radu Corcodel, Mouhacine Benosman, Jonathan P. How

TL;DR

GRAM addresses the challenge of generalizing deep RL policies to both in-distribution and unseen out-of-distribution dynamics. It unifies adaptive ID performance and robust OOD robustness through a robust adaptation module based on an epistemic neural network, and a joint training pipeline that combines teacher-student adaptation with adversarial RL. The key contributions are the robust adaptation module with a GRAM posterior mechanism and a training scheme that jointly optimizes ID and OOD behavior, validated on simulated and real quadruped locomotion. The results show strong ID performance comparable to contextual/domain randomization methods while achieving robust OOD behavior, enabling effective sim-to-real transfer in diverse environments.

Abstract

The reliable deployment of deep reinforcement learning in real-world settings requires the ability to generalize across a variety of conditions, including both in-distribution scenarios seen during training as well as novel out-of-distribution scenarios. In this work, we present a framework for dynamics generalization in deep reinforcement learning that unifies these two distinct types of generalization within a single architecture. We introduce a robust adaptation module that provides a mechanism for identifying and reacting to both in-distribution and out-of-distribution environment dynamics, along with a joint training pipeline that combines the goals of in-distribution adaptation and out-of-distribution robustness. Our algorithm GRAM achieves strong generalization performance across in-distribution and out-of-distribution scenarios upon deployment, which we demonstrate through extensive simulation and hardware locomotion experiments on a quadruped robot.

GRAM: Generalization in Deep RL with a Robust Adaptation Module

TL;DR

GRAM addresses the challenge of generalizing deep RL policies to both in-distribution and unseen out-of-distribution dynamics. It unifies adaptive ID performance and robust OOD robustness through a robust adaptation module based on an epistemic neural network, and a joint training pipeline that combines teacher-student adaptation with adversarial RL. The key contributions are the robust adaptation module with a GRAM posterior mechanism and a training scheme that jointly optimizes ID and OOD behavior, validated on simulated and real quadruped locomotion. The results show strong ID performance comparable to contextual/domain randomization methods while achieving robust OOD behavior, enabling effective sim-to-real transfer in diverse environments.

Abstract

The reliable deployment of deep reinforcement learning in real-world settings requires the ability to generalize across a variety of conditions, including both in-distribution scenarios seen during training as well as novel out-of-distribution scenarios. In this work, we present a framework for dynamics generalization in deep reinforcement learning that unifies these two distinct types of generalization within a single architecture. We introduce a robust adaptation module that provides a mechanism for identifying and reacting to both in-distribution and out-of-distribution environment dynamics, along with a joint training pipeline that combines the goals of in-distribution adaptation and out-of-distribution robustness. Our algorithm GRAM achieves strong generalization performance across in-distribution and out-of-distribution scenarios upon deployment, which we demonstrate through extensive simulation and hardware locomotion experiments on a quadruped robot.

Paper Structure

This paper contains 21 sections, 9 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: GRAM generalizes to both ID and OOD environment dynamics at deployment time with a single unified architecture. GRAM introduces a robust adaptation module that quantifies uncertainty about the deployment environment using a recent history of observations, and biases latent context estimates towards a special robust latent feature ($\star$) when uncertainty is high.
  • Figure 2: Robust adaptation module used by GRAM at deployment time. Top: Epistemic neural network $\phi$ outputs a sample mean and variance of latent feature estimates for a history $h_t$, which are used to calculate $\Phi_{\textnormal{GRAM}}$ in \ref{['eq:ra_module']}. Bottom: In ID contexts, variance of latent feature estimates will be low and $\Phi_{\textnormal{GRAM}}$ will be close to the mean estimate. In OOD contexts with different environment dynamics, variance will be high and $\Phi_{\textnormal{GRAM}}$ will output an estimate close to $z_{\textnormal{rob}}$.
  • Figure 3: Joint RL training pipeline used by GRAM, which combines standard ID data collection and adversarial data collection for every RL update. Training environments are assigned to adaptive training or robust training at each iteration, and assignments alternate between iterations. RL training is followed by supervised learning to train the adaptation network $\phi(h_t, \xi)$.
  • Figure 4: GRAM average $\alpha_t$ (top) and 25%-CVaR $\alpha_t$ (bottom). Left: Deployment environments from Table \ref{['tab:sim_results']} with default ID context set for training. Middle: Range of added base mass deployment scenarios (default vs. wide ID context sets for training). Right: 9 kg base mass added 10 seconds into deployment (default vs. wide ID context sets for training).
  • Figure 5: Average normalized task returns for 9 kg added base mass scenario. Left: Training with default ID context set from Table \ref{['tab:base_id']}. Right: Training across wide added base mass range of $[-1.00, 9.00]$.
  • ...and 3 more figures