Continual Domain Randomization

Josip Josifovski; Sayantan Auddy; Mohammadhossein Malmir; Justus Piater; Alois Knoll; Nicolás Navarro-Guerrero

Continual Domain Randomization

Josip Josifovski, Sayantan Auddy, Mohammadhossein Malmir, Justus Piater, Alois Knoll, Nicolás Navarro-Guerrero

TL;DR

The paper tackles the reality gap in sim2real robotic RL by addressing the drawbacks of training with a fixed set of randomized parameters. It introduces Continual Domain Randomization (CDR), which sequentially trains a single policy on progressively more complex simulator randomizations while using continual learning (PPO with Elastic Weight Consolidation or online-EWC) to prevent forgetting. Empirical results on reacher and grasper tasks show that CDR matches or outperforms baselines that randomize all parameters or use sequential finetuning, with improved stability and robustness to the order of randomizations. This approach offers a flexible, zero-shot transfer framework that reduces memory and complexity compared with alternative continual-sim2real methods and can be extended with automated domain or active DR to tune parameter ranges.

Abstract

Domain Randomization (DR) is commonly used for sim2real transfer of reinforcement learning (RL) policies in robotics. Most DR approaches require a simulator with a fixed set of tunable parameters from the start of the training, from which the parameters are randomized simultaneously to train a robust model for use in the real world. However, the combined randomization of many parameters increases the task difficulty and might result in sub-optimal policies. To address this problem and to provide a more flexible training process, we propose Continual Domain Randomization (CDR) for RL that combines domain randomization with continual learning to enable sequential training in simulation on a subset of randomization parameters at a time. Starting from a model trained in a non-randomized simulation where the task is easier to solve, the model is trained on a sequence of randomizations, and continual learning is employed to remember the effects of previous randomizations. Our robotic reaching and grasping tasks experiments show that the model trained in this fashion learns effectively in simulation and performs robustly on the real robot while matching or outperforming baselines that employ combined randomization or sequential randomization without continual learning. Our code and videos are available at https://continual-dr.github.io/.

Continual Domain Randomization

TL;DR

Abstract

Paper Structure (17 sections, 10 equations, 6 figures, 4 tables)

This paper contains 17 sections, 10 equations, 6 figures, 4 tables.

Introduction
Related Work
Methodology
Problem Description
Continual Domain Randomization
Baselines
Experiment Setup
Reacher Task
Grasper Task
Robot Platform
Training Procedure
Evaluation Procedure and Metrics
Experiment Results
Reacher
Grasper
...and 2 more sections

Figures (6)

Figure 1: Overview of our proposed CDR approach. CDR-$\lambda$ (blue arrow) uses a set of network snapshots and Fisher matrices (one for each past task) for continual learning, while CDR-$\mathcal{O}\lambda$ (green arrow) uses only a single parameter snapshot and Fisher matrix. Other baselines are shown with gray arrows. Each shape ($\square,\Circle, \cdots, \triangle$) represents a unique randomization parameter set. Disabled and enabled sets are indicated with gray and blue respectively.
Figure 2: The simulated and real environments for reaching and grasping.
Figure 3: Effects of different randomization parameters on sim2real transfer for the reaching task.
Figure 4: Training progress for the reaching task. Finetuing, CDR-$\lambda$ and CDR-$\mathcal{O}\lambda$ start from the Ideal model at $10^6$ timesteps, and are sequentially trained on each randomization for $10^6$ steps, shown by vertical dotted lines. The max reward is the best reward achieved by any agent, and the min reward is the one achieved by an agent that executes random actions.
Figure 5: Training progress for the grasping task. Finetuning, CDR-$\lambda$ and CDR-$\mathcal{O}\lambda$ start from the Ideal model at $4\times10^6$ timesteps, and are sequentially trained on each randomization for $2\times10^6$ steps (vertical dotted lines). The max reward is the best reward achieved by any agent, and the min reward is the one achieved by an agent that executes random actions.
...and 1 more figures

Continual Domain Randomization

TL;DR

Abstract

Continual Domain Randomization

Authors

TL;DR

Abstract

Table of Contents

Figures (6)