Table of Contents
Fetching ...

Measuring Visual Generalization in Continuous Control from Pixels

Jake Grigsby, Yanjun Qi

TL;DR

This paper tackles visual generalization in pixel-based continuous control by introducing DMCR, a benchmark that injects wide visual variation into the DeepMind Control Suite via visual seeds while keeping dynamics fixed. It shows that state-of-the-art representation learning methods struggle to generalize across diverse visuals, whereas data augmentation—especially color-altering transforms—substantially improves generalization, with Network Randomization achieving near-perfect cross-visual performance in some tasks. The work also analyzes which visual factors are most challenging and provides an encoder-regularization strategy to promote augmentation invariance, offering practical guidance for building perceptually robust image-based controllers. Overall, DMCR provides a reproducible, scalable platform for evaluating and advancing visual generalization in real-world-like continuous control settings.

Abstract

Self-supervised learning and data augmentation have significantly reduced the performance gap between state and image-based reinforcement learning agents in continuous control tasks. However, it is still unclear whether current techniques can face a variety of visual conditions required by real-world environments. We propose a challenging benchmark that tests agents' visual generalization by adding graphical variety to existing continuous control domains. Our empirical analysis shows that current methods struggle to generalize across a diverse set of visual changes, and we examine the specific factors of variation that make these tasks difficult. We find that data augmentation techniques outperform self-supervised learning approaches and that more significant image transformations provide better visual generalization \footnote{The benchmark and our augmented actor-critic implementation are open-sourced @ https://github.com/QData/dmc_remastered)

Measuring Visual Generalization in Continuous Control from Pixels

TL;DR

This paper tackles visual generalization in pixel-based continuous control by introducing DMCR, a benchmark that injects wide visual variation into the DeepMind Control Suite via visual seeds while keeping dynamics fixed. It shows that state-of-the-art representation learning methods struggle to generalize across diverse visuals, whereas data augmentation—especially color-altering transforms—substantially improves generalization, with Network Randomization achieving near-perfect cross-visual performance in some tasks. The work also analyzes which visual factors are most challenging and provides an encoder-regularization strategy to promote augmentation invariance, offering practical guidance for building perceptually robust image-based controllers. Overall, DMCR provides a reproducible, scalable platform for evaluating and advancing visual generalization in real-world-like continuous control settings.

Abstract

Self-supervised learning and data augmentation have significantly reduced the performance gap between state and image-based reinforcement learning agents in continuous control tasks. However, it is still unclear whether current techniques can face a variety of visual conditions required by real-world environments. We propose a challenging benchmark that tests agents' visual generalization by adding graphical variety to existing continuous control domains. Our empirical analysis shows that current methods struggle to generalize across a diverse set of visual changes, and we examine the specific factors of variation that make these tasks difficult. We find that data augmentation techniques outperform self-supervised learning approaches and that more significant image transformations provide better visual generalization \footnote{The benchmark and our augmented actor-critic implementation are open-sourced @ https://github.com/QData/dmc_remastered)

Paper Structure

This paper contains 28 sections, 4 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Example "Walker, Walk" visual seeds. The random aesthetic changes generated by DMCR allow for significant visual diversity. $\phi_0$ uses the default DMC assets.
  • Figure 2: Pixel SAC variants trained on 4 different visual seeds from the DMCR version of "Walker, Walk". Final performance is relatively consistent, though the pure data augmentation methods are prone to occasional collapses early in training that are difficult to recover from. We also provide a comparison of the DMC and DMCR versions of "Ball in Cup, Catch" (in bottom right).
  • Figure 3: Standard deviation of the encoder network's output across renderings of the same state with different floor textures and colors. Indices sorted by increasing variance. Results averaged over five agents.
  • Figure 4: Spatial attention maps of trained SAC+AUG and SAC+CJ+AUG agents. Computed by taking the channel-wise average of encoder layer activations, and overlaying them on the original observation. Green and red heatmap colors indicate high levels of attention. Both agents are trained with the default checkerboard floor, but the CJ agent is less distracted by a change to a concrete texture.
  • Figure 5: Testing a range of $\beta$ values on "Walker, Walk" and "Cartpole, Balance." We find there to be little difference over short training runs.
  • ...and 8 more figures