Table of Contents
Fetching ...

DARLA: Improving Zero-Shot Transfer in Reinforcement Learning

Irina Higgins, Arka Pal, Andrei A. Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, Alexander Lerchner

TL;DR

The paper addresses domain adaptation in reinforcement learning by introducing DARLA, a three-stage agent that first learns to perceive the environment through disentangled representations, then learns a robust source policy, and finally transfers zero-shot to target domains. DARLA uses a beta-VAE framework enhanced with perceptual similarity loss via a pre-trained denoising autoencoder to obtain factorised latent states that generalise across domain shifts. Empirical results across DeepMind Lab, MuJoCo, and sim2real tasks show significant, consistent gains in zero-shot transfer over baselines and across multiple RL algorithms, with a strong correlation between representation disentanglement and transfer quality. The work demonstrates that learning disentangled vision is a key lever for robust domain adaptation in deep RL and offers a broadly applicable, model-agnostic pipeline for improved transfer.

Abstract

Domain adaptation is an important open problem in deep reinforcement learning (RL). In many scenarios of interest data is hard to obtain, so agents may learn a source policy in a setting where data is readily available, with the hope that it generalises well to the target domain. We propose a new multi-stage RL agent, DARLA (DisentAngled Representation Learning Agent), which learns to see before learning to act. DARLA's vision is based on learning a disentangled representation of the observed environment. Once DARLA can see, it is able to acquire source policies that are robust to many domain shifts - even with no access to the target domain. DARLA significantly outperforms conventional baselines in zero-shot domain adaptation scenarios, an effect that holds across a variety of RL environments (Jaco arm, DeepMind Lab) and base RL algorithms (DQN, A3C and EC).

DARLA: Improving Zero-Shot Transfer in Reinforcement Learning

TL;DR

The paper addresses domain adaptation in reinforcement learning by introducing DARLA, a three-stage agent that first learns to perceive the environment through disentangled representations, then learns a robust source policy, and finally transfers zero-shot to target domains. DARLA uses a beta-VAE framework enhanced with perceptual similarity loss via a pre-trained denoising autoencoder to obtain factorised latent states that generalise across domain shifts. Empirical results across DeepMind Lab, MuJoCo, and sim2real tasks show significant, consistent gains in zero-shot transfer over baselines and across multiple RL algorithms, with a strong correlation between representation disentanglement and transfer quality. The work demonstrates that learning disentangled vision is a key lever for robust domain adaptation in deep RL and offers a broadly applicable, model-agnostic pipeline for improved transfer.

Abstract

Domain adaptation is an important open problem in deep reinforcement learning (RL). In many scenarios of interest data is hard to obtain, so agents may learn a source policy in a setting where data is readily available, with the hope that it generalises well to the target domain. We propose a new multi-stage RL agent, DARLA (DisentAngled Representation Learning Agent), which learns to see before learning to act. DARLA's vision is based on learning a disentangled representation of the observed environment. Once DARLA can see, it is able to acquire source policies that are robust to many domain shifts - even with no access to the target domain. DARLA significantly outperforms conventional baselines in zero-shot domain adaptation scenarios, an effect that holds across a variety of RL environments (Jaco arm, DeepMind Lab) and base RL algorithms (DQN, A3C and EC).

Paper Structure

This paper contains 34 sections, 3 equations, 10 figures.

Figures (10)

  • Figure 1: Schematic representation of DARLA. Yellow represents the denoising autoencoder part of the model, blue represents the $\beta$-VAE part of the model, and grey represents the policy learning part of the model.
  • Figure 2: A: DeepMind Lab deepmind_lab transfer task setup. Different conjunctions of {room, object$_1$, object$_2$} were used during different parts of the domain adaptation curriculum. During stage one, $D_U$ (shown in yellow), we used a minimal set spanning all objects and all rooms whereby each object is seen in each room. Note there is no extrinsic reward signal or notion of 'task' in this phase. During stage two, $D_S$ (shown in green), the RL agents were taught to pick up cans and balloons and avoid hats and cakes. The objects were always presented in pairs hat/can and cake/balloon. The agent never saw the hat/can pair in the pink room. This novel room/object conjunction was presented as the target domain adaptation condition $D_T$ (shown in red) where the ability of the agent to transfer knowledge of the objects' value to a novel environment was tested. B: $\beta$-VAE reconstructions (bottom row) using frames from DeepMind Lab (top row). Due to the increased $\beta>1$ necessary to disentangle the data generative factors of variations the model lost information about objects. See Fig. \ref{['fig_vae_ae_traversals']} for a model appropriately capturing objects. C: left -- sample frames from MuJoCo simulation environments used for vision (phase 1, $S_U$) and source policy training (phase 2, $S_S$); middle -- sim2sim domain adaptation test (phase 3, $S_T$); and right -- sim2real domain adaptation test (phase 3, $S_T$).
  • Figure 3: Plot of traversals of various latents of an entangled and a disentangled version of $\beta\text{-VAE}_{DAE}$ using frames from DeepMind Lab deepmind_lab.
  • Figure 4: Plot of traversals of $\beta$-VAE on MuJoCo. Using a disentangled $\beta$-VAE model, single latents directly control for the factors responsible for the object or arm placements.
  • Figure 5: Table: Zero-shot performance (avg. reward per episode) of the source policy $\pi_S$ in target domains within DeepMind Lab and Jaco/MuJoCo environments. Baseline agent refers to vanilla DQN/A3C/EC (DeepMind Lab) or A3C (Jaco) agents. See main text for more detailed model descriptions. Figure: Correlation between zero-shot performance transfer performance on the DeepMind Lab task obtained by EC based DARLA and the level of disentanglement as measured by the transfer/disentanglement score ($r=0.6$, $p<0.001$)
  • ...and 5 more figures