Table of Contents
Fetching ...

Decoupling Representation Learning from Reinforcement Learning

Adam Stooke, Kimin Lee, Pieter Abbeel, Michael Laskin

TL;DR

ATC introduces reward-free, contrastive representation learning that decouples encoder training from policy optimization in vision-based RL. The approach uses augmented temporal pairs and a momentum-contrastive setup to learn encodings that generalize across tasks and domains, often matching or surpassing end-to-end RL and other UL baselines. Across DMControl, DMLab, and Atari, ATC demonstrates strong online performance, competitive offline pretraining, and partial transfer in multi-task settings, with ablations clarifying the roles of augmentation and temporal structure. This decoupled paradigm offers practical benefits for scalable, reusable representations in RL, including improved efficiency and flexibility for batch/offline contexts.

Abstract

In an effort to overcome limitations of reward-driven feature learning in deep reinforcement learning (RL) from images, we propose decoupling representation learning from policy learning. To this end, we introduce a new unsupervised learning (UL) task, called Augmented Temporal Contrast (ATC), which trains a convolutional encoder to associate pairs of observations separated by a short time difference, under image augmentations and using a contrastive loss. In online RL experiments, we show that training the encoder exclusively using ATC matches or outperforms end-to-end RL in most environments. Additionally, we benchmark several leading UL algorithms by pre-training encoders on expert demonstrations and using them, with weights frozen, in RL agents; we find that agents using ATC-trained encoders outperform all others. We also train multi-task encoders on data from multiple environments and show generalization to different downstream RL tasks. Finally, we ablate components of ATC, and introduce a new data augmentation to enable replay of (compressed) latent images from pre-trained encoders when RL requires augmentation. Our experiments span visually diverse RL benchmarks in DeepMind Control, DeepMind Lab, and Atari, and our complete code is available at https://github.com/astooke/rlpyt/tree/master/rlpyt/ul.

Decoupling Representation Learning from Reinforcement Learning

TL;DR

ATC introduces reward-free, contrastive representation learning that decouples encoder training from policy optimization in vision-based RL. The approach uses augmented temporal pairs and a momentum-contrastive setup to learn encodings that generalize across tasks and domains, often matching or surpassing end-to-end RL and other UL baselines. Across DMControl, DMLab, and Atari, ATC demonstrates strong online performance, competitive offline pretraining, and partial transfer in multi-task settings, with ablations clarifying the roles of augmentation and temporal structure. This decoupled paradigm offers practical benefits for scalable, reusable representations in RL, including improved efficiency and flexibility for batch/offline contexts.

Abstract

In an effort to overcome limitations of reward-driven feature learning in deep reinforcement learning (RL) from images, we propose decoupling representation learning from policy learning. To this end, we introduce a new unsupervised learning (UL) task, called Augmented Temporal Contrast (ATC), which trains a convolutional encoder to associate pairs of observations separated by a short time difference, under image augmentations and using a contrastive loss. In online RL experiments, we show that training the encoder exclusively using ATC matches or outperforms end-to-end RL in most environments. Additionally, we benchmark several leading UL algorithms by pre-training encoders on expert demonstrations and using them, with weights frozen, in RL agents; we find that agents using ATC-trained encoders outperform all others. We also train multi-task encoders on data from multiple environments and show generalization to different downstream RL tasks. Finally, we ablate components of ATC, and introduce a new data augmentation to enable replay of (compressed) latent images from pre-trained encoders when RL requires augmentation. Our experiments span visually diverse RL benchmarks in DeepMind Control, DeepMind Lab, and Atari, and our complete code is available at https://github.com/astooke/rlpyt/tree/master/rlpyt/ul.

Paper Structure

This paper contains 31 sections, 2 equations, 15 figures, 8 tables, 1 algorithm.

Figures (15)

  • Figure 1: Augmented Temporal Contrast---augmented observations are processed through a learned encoder $f_\theta$, compressor, $g_\phi$ and residual predictor $h_\psi$, and are associated through a contrastive loss with a positive example from $k$ time steps later, processed through a momentum encoder.
  • Figure 2: Online encoder training by ATC, fully detached from RL training, performs as well as end-to-end RL in DMControl, and better in sparse-reward environments (environment steps shown, see appendix for action repeats). Each curve is 10 random seeds.
  • Figure 3: Online encoder training by ATC, fully detached from the RL agent, performs as well or better than end-to-end RL in DMLab (1 agent step = 4 environment steps, the standard action repeat). Prioritized ATC replay (Explore) or increased ATC training (Lasertag) addresses sparsities to nearly match performance of RL with ATC as an auxiliary loss (RL+ATC). Each curve is 5 random seeds.
  • Figure 4: Online encoder training using ATC, fully detached from the RL agent, works well in 5 of 8 games tested (1 agent step = 4 environment steps, the standard action repeat). 6 of 8 games benefit significantly from using ATC as an auxiliary loss or for weight initialization. Each curve is 8 random seeds.
  • Figure 5: RL in DMControl, using encoders pre-trained on expert demonstrations using UL, with weights frozen---across all domains, ATC outperforms prior methods and the end-to-end RL reference. Each curve is a mininum of 4 random seeds.
  • ...and 10 more figures