Mapping representations in Reinforcement Learning via Semantic Alignment for Zero-Shot Stitching
Antonio Pio Ricciardi, Valentino Maiorca, Luca Moschella, Riccardo Marin, Emanuele Rodolà
TL;DR
This work tackles the challenge of generalization in reinforcement learning under visual and task shifts by enabling zero-shot reuse of trained components. It introduces Semantic Alignment for Policy Stitching (SAPS), which learns a lightweight affine mapping $\tau_u^v$ between latent spaces $\mathcal{X}_u^i$ and $\mathcal{X}_v^j$ using a small set of semantically aligned anchors, allowing encoders and controllers from different models to be stitched without retraining. Empirical results on CarRacing and LunarLander show SAPS achieving near end-to-end performance across diverse domain shifts, outperforming naive stitching and often surpassing zero-shot baselines like R3L, with latent-space analyses confirming effective alignment. The approach supports modular, robust RL in dynamic environments and points to future work on relaxing anchor requirements and extending to robotics and more drastic domain gaps.
Abstract
Deep Reinforcement Learning (RL) models often fail to generalize when even small changes occur in the environment's observations or task requirements. Addressing these shifts typically requires costly retraining, limiting the reusability of learned policies. In this paper, we build on recent work in semantic alignment to propose a zero-shot method for mapping between latent spaces across different agents trained on different visual and task variations. Specifically, we learn a transformation that maps embeddings from one agent's encoder to another agent's encoder without further fine-tuning. Our approach relies on a small set of "anchor" observations that are semantically aligned, which we use to estimate an affine or orthogonal transform. Once the transformation is found, an existing controller trained for one domain can interpret embeddings from a different (existing) encoder in a zero-shot fashion, skipping additional trainings. We empirically demonstrate that our framework preserves high performance under visual and task domain shifts. We empirically demonstrate zero-shot stitching performance on the CarRacing environment with changing background and task. By allowing modular re-assembly of existing policies, it paves the way for more robust, compositional RL in dynamically changing environments.
