Table of Contents
Fetching ...

Generalization in Reinforcement Learning by Soft Data Augmentation

Nicklas Hansen, Xiaolong Wang

TL;DR

This work tackles the generalization gap in vision-based reinforcement learning by decoupling data augmentation from policy optimization through SOft Data Augmentation (SODA). SODA uses a self-supervised latent-mapping objective to maximize information shared between augmented and non-augmented observations while the RL policy is trained only on non-augmented data, improving stability and sample efficiency. Empirical results on DMControl-GB and a robotic manipulation task show that SODA outperforms strong baselines in generalization to unseen environments and under strong visual perturbations, with notable gains in color and video-background scenarios. The approach provides a practical, architecture-agnostic method to leverage robust data augmentations via representation learning without destabilizing RL optimization, and the authors release DMControl-GB as an open benchmark.

Abstract

Extensive efforts have been made to improve the generalization ability of Reinforcement Learning (RL) methods via domain randomization and data augmentation. However, as more factors of variation are introduced during training, optimization becomes increasingly challenging, and empirically may result in lower sample efficiency and unstable training. Instead of learning policies directly from augmented data, we propose SOft Data Augmentation (SODA), a method that decouples augmentation from policy learning. Specifically, SODA imposes a soft constraint on the encoder that aims to maximize the mutual information between latent representations of augmented and non-augmented data, while the RL optimization process uses strictly non-augmented data. Empirical evaluations are performed on diverse tasks from DeepMind Control suite as well as a robotic manipulation task, and we find SODA to significantly advance sample efficiency, generalization, and stability in training over state-of-the-art vision-based RL methods.

Generalization in Reinforcement Learning by Soft Data Augmentation

TL;DR

This work tackles the generalization gap in vision-based reinforcement learning by decoupling data augmentation from policy optimization through SOft Data Augmentation (SODA). SODA uses a self-supervised latent-mapping objective to maximize information shared between augmented and non-augmented observations while the RL policy is trained only on non-augmented data, improving stability and sample efficiency. Empirical results on DMControl-GB and a robotic manipulation task show that SODA outperforms strong baselines in generalization to unseen environments and under strong visual perturbations, with notable gains in color and video-background scenarios. The approach provides a practical, architecture-agnostic method to leverage robust data augmentations via representation learning without destabilizing RL optimization, and the authors release DMControl-GB as an open benchmark.

Abstract

Extensive efforts have been made to improve the generalization ability of Reinforcement Learning (RL) methods via domain randomization and data augmentation. However, as more factors of variation are introduced during training, optimization becomes increasingly challenging, and empirically may result in lower sample efficiency and unstable training. Instead of learning policies directly from augmented data, we propose SOft Data Augmentation (SODA), a method that decouples augmentation from policy learning. Specifically, SODA imposes a soft constraint on the encoder that aims to maximize the mutual information between latent representations of augmented and non-augmented data, while the RL optimization process uses strictly non-augmented data. Empirical evaluations are performed on diverse tasks from DeepMind Control suite as well as a robotic manipulation task, and we find SODA to significantly advance sample efficiency, generalization, and stability in training over state-of-the-art vision-based RL methods.

Paper Structure

This paper contains 18 sections, 4 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Generalization in RL. Agents are trained in a fixed environment (denoted the training environment) and we measure generalization to unseen environments with (i) random colors and (ii) video backgrounds. To simulate real-world deployment, we additionally randomize camera, lighting, and texture during evaluation in the robotic manipulation task. Additional samples are shown in appendix \ref{['sec:appendix-dmc-gb']}.
  • Figure 2: SODA architecture.Left: an observation $o$ is augmented to produce a view $o'$, which is then encoded and projected into $z'=g_{\theta}(f_{\theta}(o'))$. Likewise, $o$ is encoded by $f_{\psi}$ and projected by $g_{\psi}$ to produce features $z^{\star}$. The SODA objective is then to predict $z^{\star}$ from $z'$ by $h_{\theta}$ formulated as a consistency loss. Right: Reinforcement Learning in SODA. The RL task remains unchanged and is trained directly on the non-augmented observations $o$. ema denotes an exponential moving average.
  • Figure 3: Data augmentation. We consider the following two data augmentations: random convolution (as proposed by Lee2019ASRlaskin2020reinforcement) and random overlay (novel). See appendix \ref{['sec:appendix-data-augmentations']} for additional data augmentation samples.
  • Figure 4: Random convolution.Top: average return on the training environment during training. Bottom: periodic evaluation of generalization ability measured by average return on the random color environment. SODA exhibits sample efficiency and convergence similar to SAC but improves generalization significantly. Average of 5 runs, shaded area is std. deviation.
  • Figure 5: Soft data augmentation. Average return on the training environment for walker_walk and walker_stand tasks. Augment RL corresponds to the SAC (conv) baseline, Augment both applies random convolution in both SODA and RL, and Augment SODA is the proposed formulation of SODA. Average of 5 runs, shaded area is std. deviation.
  • ...and 4 more figures