Table of Contents
Fetching ...

Reinforcement Learning with Prototypical Representations

Denis Yarats, Rob Fergus, Alessandro Lazaric, Lerrel Pinto

TL;DR

Proto-RL tackles the core RL challenge of learning effective representations from images by coupling prototypical latent representations with an intrinsic, entropy-based exploration signal. It pretrains an encoder and a library of prototypes in a task-agnostic phase, using a SwAV-inspired clustering objective and a nearest-neighbor entropy estimator in latent space to drive exploration. The learned representations generalize to unseen downstream DM Control Suite tasks, enabling faster, more robust policy learning, particularly in sparse-reward settings, with improved state-space coverage. This approach demonstrates that task-agnostic prototypical representations can significantly enhance downstream exploration and sample efficiency, offering a practical route toward more generalizable, fine-tunable RL systems.

Abstract

Learning effective representations in image-based environments is crucial for sample efficient Reinforcement Learning (RL). Unfortunately, in RL, representation learning is confounded with the exploratory experience of the agent -- learning a useful representation requires diverse data, while effective exploration is only possible with coherent representations. Furthermore, we would like to learn representations that not only generalize across tasks but also accelerate downstream exploration for efficient task-specific training. To address these challenges we propose Proto-RL, a self-supervised framework that ties representation learning with exploration through prototypical representations. These prototypes simultaneously serve as a summarization of the exploratory experience of an agent as well as a basis for representing observations. We pre-train these task-agnostic representations and prototypes on environments without downstream task information. This enables state-of-the-art downstream policy learning on a set of difficult continuous control tasks.

Reinforcement Learning with Prototypical Representations

TL;DR

Proto-RL tackles the core RL challenge of learning effective representations from images by coupling prototypical latent representations with an intrinsic, entropy-based exploration signal. It pretrains an encoder and a library of prototypes in a task-agnostic phase, using a SwAV-inspired clustering objective and a nearest-neighbor entropy estimator in latent space to drive exploration. The learned representations generalize to unseen downstream DM Control Suite tasks, enabling faster, more robust policy learning, particularly in sparse-reward settings, with improved state-space coverage. This approach demonstrates that task-agnostic prototypical representations can significantly enhance downstream exploration and sample efficiency, offering a practical route toward more generalizable, fine-tunable RL systems.

Abstract

Learning effective representations in image-based environments is crucial for sample efficient Reinforcement Learning (RL). Unfortunately, in RL, representation learning is confounded with the exploratory experience of the agent -- learning a useful representation requires diverse data, while effective exploration is only possible with coherent representations. Furthermore, we would like to learn representations that not only generalize across tasks but also accelerate downstream exploration for efficient task-specific training. To address these challenges we propose Proto-RL, a self-supervised framework that ties representation learning with exploration through prototypical representations. These prototypes simultaneously serve as a summarization of the exploratory experience of an agent as well as a basis for representing observations. We pre-train these task-agnostic representations and prototypes on environments without downstream task information. This enables state-of-the-art downstream policy learning on a set of difficult continuous control tasks.

Paper Structure

This paper contains 53 sections, 9 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: An example of Proto-RL running on the pixel-based U-maze pointmass environment with (a) task-agnostic pre-training, followed by (b) downstream RL. (a): task-agnostic exploration and representation learning stage. The state-visitation distribution is shown in blue, which converges to uniform coverage with sufficient steps. Red points depict the prototypes via closest states in the embedding space. (b): subsequent application to the Reach Center task with sparse reward. Note the rapid exploration of the environment facilitated by the pre-trained prototypes (and embedding function). Proto-RL discovers the goal location in only 200k steps, while other methods struggle to solve the task. The experiment details are provided in \ref{['section:point_mass']}.
  • Figure 2: Proto-RL proposes a self-supervised scheme that learns to encode high-dimensional image observations ${\bm{x}}_t$, ${\bm{x}}_{t+1}$, using an encoder $f_\theta$ along with a set of prototypes $\{{{\bm{c}}_i}\}_{i=1}^M$ that defines the basis of the latent space. Learning is done by optimizing the clustering assignment loss $\mathcal{L}_{\mathrm{SSL}}$. To encourage exploration, prototypes are simultaneously used to compute an entropy-based intrinsic reward $\hat{r}_t$ that is maximized by the exploration agent. To decouple representation learning from the exploration task, we block the gradients of the agent's RL loss $\mathcal{L}_{\mathrm{RL}}$ from updating the encoder and prototypes. See \ref{['section:method']} for a full description.
  • Figure 3: The entropy-based intrinsic reward used by Proto-RL. This employs a nearest-neighbor estimator (\ref{['eqn:entropy']}) computed over a set of embeddings ${\bm{Q}}$ that are uniformly drawn from clustering of a batch of encoded observations $\{{\bm{z}}_i\}_{i=1}^B$ with the current prototypes $\{{{\bm{c}}_i}\}_{i=1}^M$. See \ref{['section:proto_exploration']} for more details.
  • Figure 4: Single task evaluation using eight challenging environments from DeepMind Control Suite. For each method (except for DrQ and Plan2Explore), we first perform task-agnostic pretraining for 500k environment steps, before introducing task reward and training for a further 500k steps. DrQ uses task reward from the outset. Plan2Explore, being model-based, uses an intermediate methodology, described in \ref{['subsection:exp_setup']}. Proto-RL consistently beats the baselines and in many cases exceeds the fully supervised approach of DrQ.
  • Figure 5: Multi-task evaluation using two domains from DeepMind Control Suite, with four tasks in each. We perform task-agnostic pre-training for 500k steps in each domain. The frozen representation and prototypes are then applied separately to each of the four tasks, training for additional 500k steps with the task reward. DrQ performance is measured after training for 500k steps. The results show that the representations learned by Proto-RL generalize well and enable efficient learning of multiple downstream tasks.
  • ...and 8 more figures