Table of Contents
Fetching ...

Mapping representations in Reinforcement Learning via Semantic Alignment for Zero-Shot Stitching

Antonio Pio Ricciardi, Valentino Maiorca, Luca Moschella, Riccardo Marin, Emanuele Rodolà

TL;DR

This work tackles the challenge of generalization in reinforcement learning under visual and task shifts by enabling zero-shot reuse of trained components. It introduces Semantic Alignment for Policy Stitching (SAPS), which learns a lightweight affine mapping $\tau_u^v$ between latent spaces $\mathcal{X}_u^i$ and $\mathcal{X}_v^j$ using a small set of semantically aligned anchors, allowing encoders and controllers from different models to be stitched without retraining. Empirical results on CarRacing and LunarLander show SAPS achieving near end-to-end performance across diverse domain shifts, outperforming naive stitching and often surpassing zero-shot baselines like R3L, with latent-space analyses confirming effective alignment. The approach supports modular, robust RL in dynamic environments and points to future work on relaxing anchor requirements and extending to robotics and more drastic domain gaps.

Abstract

Deep Reinforcement Learning (RL) models often fail to generalize when even small changes occur in the environment's observations or task requirements. Addressing these shifts typically requires costly retraining, limiting the reusability of learned policies. In this paper, we build on recent work in semantic alignment to propose a zero-shot method for mapping between latent spaces across different agents trained on different visual and task variations. Specifically, we learn a transformation that maps embeddings from one agent's encoder to another agent's encoder without further fine-tuning. Our approach relies on a small set of "anchor" observations that are semantically aligned, which we use to estimate an affine or orthogonal transform. Once the transformation is found, an existing controller trained for one domain can interpret embeddings from a different (existing) encoder in a zero-shot fashion, skipping additional trainings. We empirically demonstrate that our framework preserves high performance under visual and task domain shifts. We empirically demonstrate zero-shot stitching performance on the CarRacing environment with changing background and task. By allowing modular re-assembly of existing policies, it paves the way for more robust, compositional RL in dynamically changing environments.

Mapping representations in Reinforcement Learning via Semantic Alignment for Zero-Shot Stitching

TL;DR

This work tackles the challenge of generalization in reinforcement learning under visual and task shifts by enabling zero-shot reuse of trained components. It introduces Semantic Alignment for Policy Stitching (SAPS), which learns a lightweight affine mapping between latent spaces and using a small set of semantically aligned anchors, allowing encoders and controllers from different models to be stitched without retraining. Empirical results on CarRacing and LunarLander show SAPS achieving near end-to-end performance across diverse domain shifts, outperforming naive stitching and often surpassing zero-shot baselines like R3L, with latent-space analyses confirming effective alignment. The approach supports modular, robust RL in dynamic environments and points to future work on relaxing anchor requirements and extending to robotics and more drastic domain gaps.

Abstract

Deep Reinforcement Learning (RL) models often fail to generalize when even small changes occur in the environment's observations or task requirements. Addressing these shifts typically requires costly retraining, limiting the reusability of learned policies. In this paper, we build on recent work in semantic alignment to propose a zero-shot method for mapping between latent spaces across different agents trained on different visual and task variations. Specifically, we learn a transformation that maps embeddings from one agent's encoder to another agent's encoder without further fine-tuning. Our approach relies on a small set of "anchor" observations that are semantically aligned, which we use to estimate an affine or orthogonal transform. Once the transformation is found, an existing controller trained for one domain can interpret embeddings from a different (existing) encoder in a zero-shot fashion, skipping additional trainings. We empirically demonstrate that our framework preserves high performance under visual and task domain shifts. We empirically demonstrate zero-shot stitching performance on the CarRacing environment with changing background and task. By allowing modular re-assembly of existing policies, it paves the way for more robust, compositional RL in dynamically changing environments.

Paper Structure

This paper contains 26 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Using translation methods, a controller trained on an environment with a given visual variation (left) can be reused without any training or fine-tuning on a different environment (right) with comparable performance. In red we see the trajectory of a car driven by the same controller when connected to two different encoders, one for each visual variation.
  • Figure 2: PCA visualization of encoder outputs. On the left, we illustrate how an affine alignment can effectively map one latent space to another: same frames with different backgrounds (green/red) cluster together, as indicated by the embedded screenshots. On the right, the source, unaligned embeddings remain separated, highlighting the benefit of our alignment approach in unifying observations from different environment variations.
  • Figure 3: Histogram of pairwise cosine similarities between matched states from two different environment variations, for CarRacing (top) and LunarLander (bottom). Both SAPS and R3L show very high mean similarity along paired frames, indicating that corresponding observations in each variation map to nearly identical vectors. Mean similarity for encoders without any alignment or relative encoding is very low, emphasizing the utility of latent communication methods.