Table of Contents
Fetching ...

R3L: Relative Representations for Reinforcement Learning

Antonio Pio Ricciardi, Valentino Maiorca, Luca Moschella, Riccardo Marin, Emanuele Rodolà

TL;DR

R3L tackles the problem of generalization in visual reinforcement learning under domain shifts by proposing Relative Representations, which map encodings from different visual-task settings into a shared latent space. By embedding encoders via anchors and using an exponential moving average to stabilize anchors, R3L enables zero-shot stitching of independently trained encoders and controllers, effectively reusing components to handle unseen visual-task pairs. Empirical results across CarRacing and Atari demonstrate that end-to-end performance remains comparable to standard baselines and that zero-shot stitching significantly reduces training time while increasing flexibility. This work advances modular RL by providing a principled approach to latent-space alignment and component reuse, with practical impact in reducing computational costs and enabling rapid deployment across varied environments.

Abstract

Visual Reinforcement Learning is a popular and powerful framework that takes full advantage of the Deep Learning breakthrough. It is known that variations in input domains (e.g., different panorama colors due to seasonal changes) or task domains (e.g., altering the target speed of a car) can disrupt agent performance, necessitating new training for each variation. Recent advancements in the field of representation learning have demonstrated the possibility of combining components from different neural networks to create new models in a zero-shot fashion. In this paper, we build upon relative representations, a framework that maps encoder embeddings to a universal space. We adapt this framework to the Visual Reinforcement Learning setting, allowing to combine agents components to create new agents capable of effectively handling novel visual-task pairs not encountered during training. Our findings highlight the potential for model reuse, significantly reducing the need for retraining and, consequently, the time and computational resources required.

R3L: Relative Representations for Reinforcement Learning

TL;DR

R3L tackles the problem of generalization in visual reinforcement learning under domain shifts by proposing Relative Representations, which map encodings from different visual-task settings into a shared latent space. By embedding encoders via anchors and using an exponential moving average to stabilize anchors, R3L enables zero-shot stitching of independently trained encoders and controllers, effectively reusing components to handle unseen visual-task pairs. Empirical results across CarRacing and Atari demonstrate that end-to-end performance remains comparable to standard baselines and that zero-shot stitching significantly reduces training time while increasing flexibility. This work advances modular RL by providing a principled approach to latent-space alignment and component reuse, with practical impact in reducing computational costs and enabling rapid deployment across varied environments.

Abstract

Visual Reinforcement Learning is a popular and powerful framework that takes full advantage of the Deep Learning breakthrough. It is known that variations in input domains (e.g., different panorama colors due to seasonal changes) or task domains (e.g., altering the target speed of a car) can disrupt agent performance, necessitating new training for each variation. Recent advancements in the field of representation learning have demonstrated the possibility of combining components from different neural networks to create new models in a zero-shot fashion. In this paper, we build upon relative representations, a framework that maps encoder embeddings to a universal space. We adapt this framework to the Visual Reinforcement Learning setting, allowing to combine agents components to create new agents capable of effectively handling novel visual-task pairs not encountered during training. Our findings highlight the potential for model reuse, significantly reducing the need for retraining and, consequently, the time and computational resources required.
Paper Structure (43 sections, 6 equations, 5 figures, 6 tables)

This paper contains 43 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Left: Standard (absolute) encoder outputs for the two visual variations (red/green). Notice how the embeddings occupy two distinct regions of the space. Right: R3L encoder outputs. Here, points from the two sets align nearly one‐to‐one, indicating closer correspondence in their learned representations.
  • Figure 2: Data collection pipeline. We can either use the same sequence of actions on two different env variations, or apply a transformation on top of another set.
  • Figure 3: (a) Comparison between absolute (left) and relative (right) representations produced by the same model. Rows and columns show the cosine similarity between the latent spaces coming from frames of the CarRacing environment with different visual variations (i.e., green and red grass color). Relative representations let similarities emerge not only along the diagonal, where frames are aligned, but also off-diagonal, highlighting similarities between different parts of the track. (b) We report qualitative examples by visualizing frame pairs associated to high similarity regions in (a) (denoted by the frame number). Each pair is semantically similar, even though not in direct correspondence.
  • Figure 4: Comparison of Abs and R3L Training Curves Across Environments.Top Row: Training curves for three variations of the CarRacing environment, demonstrating that both Abs and R3L methods exhibit similar convergence tendencies, indicating that relative encoding does not cause training instability. Bottom Row: Training curves for three Atari games (Breakout, Boxing, and Pong), further supporting that both methods maintain stable and comparable performance across different gaming environments.
  • Figure 5: Comparison of evaluation scores over training frames using different values of the exponential moving average coefficient ($\alpha$). Solid lines represent mean evaluation scores, shaded regions indicate standard deviations, and the dashed red line denotes the absolute evaluation score.