Table of Contents
Fetching ...

Latent Space Translation via Semantic Alignment

Valentino Maiorca, Luca Moschella, Antonio Norelli, Marco Fumero, Francesco Locatello, Emanuele Rodolà

TL;DR

Latent Space Translation via Semantic Alignment studies translating latent representations across pretrained networks using a translator $\\mathcal{T}$. It formulates an affine translator with constrained variants and learns it from a small set of parallel anchors after standard-scaling. Across cross-architecture, cross-modality, and autoencoding settings, the approach enables zero-shot stitching of arbitrary encoders and decoders without retraining, often matching baselines and outperforming naive absolute stitching. The findings offer practical guidance on anchor count and scaling and demonstrate broad applicability for multimodal model reuse.

Abstract

While different neural models often exhibit latent spaces that are alike when exposed to semantically related data, this intrinsic similarity is not always immediately discernible. Towards a better understanding of this phenomenon, our work shows how representations learned from these neural modules can be translated between different pre-trained networks via simpler transformations than previously thought. An advantage of this approach is the ability to estimate these transformations using standard, well-understood algebraic procedures that have closed-form solutions. Our method directly estimates a transformation between two given latent spaces, thereby enabling effective stitching of encoders and decoders without additional training. We extensively validate the adaptability of this translation procedure in different experimental settings: across various trainings, domains, architectures (e.g., ResNet, CNN, ViT), and in multiple downstream tasks (classification, reconstruction). Notably, we show how it is possible to zero-shot stitch text encoders and vision decoders, or vice-versa, yielding surprisingly good classification performance in this multimodal setting.

Latent Space Translation via Semantic Alignment

TL;DR

Latent Space Translation via Semantic Alignment studies translating latent representations across pretrained networks using a translator . It formulates an affine translator with constrained variants and learns it from a small set of parallel anchors after standard-scaling. Across cross-architecture, cross-modality, and autoencoding settings, the approach enables zero-shot stitching of arbitrary encoders and decoders without retraining, often matching baselines and outperforming naive absolute stitching. The findings offer practical guidance on anchor count and scaling and demonstrate broad applicability for multimodal model reuse.

Abstract

While different neural models often exhibit latent spaces that are alike when exposed to semantically related data, this intrinsic similarity is not always immediately discernible. Towards a better understanding of this phenomenon, our work shows how representations learned from these neural modules can be translated between different pre-trained networks via simpler transformations than previously thought. An advantage of this approach is the ability to estimate these transformations using standard, well-understood algebraic procedures that have closed-form solutions. Our method directly estimates a transformation between two given latent spaces, thereby enabling effective stitching of encoders and decoders without additional training. We extensively validate the adaptability of this translation procedure in different experimental settings: across various trainings, domains, architectures (e.g., ResNet, CNN, ViT), and in multiple downstream tasks (classification, reconstruction). Notably, we show how it is possible to zero-shot stitch text encoders and vision decoders, or vice-versa, yielding surprisingly good classification performance in this multimodal setting.
Paper Structure (36 sections, 2 equations, 13 figures, 8 tables)

This paper contains 36 sections, 2 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Zero-shot stitching of $\mathbf{X}$ and $\mathbf{Y}$ absolute spaces utilizing relative representations and our method (the estimation of $\mathcal{T}$). Our approach does not require a decoder specifically trained on relative representations ($dec_\mathbb{Z}$). Instead, we directly translate latent spaces, enabling the use of arbitrarily pre-trained decoders originally trained on absolute spaces.
  • Figure 2: Method illustration on a synthetic example. Given a source space $\mathbf{X}$, the steps to translate it to a target $\mathbf{Y}$ are sequentially applied as described in \ref{['sec:translation']}. Note that the translation is not perfect due to an arbitrary distortion of the data.
  • Figure 3: Performance comparison of affine, linear, l-ortho, and ortho at varying number of anchors on classification accuracy. Results on CIFAR100 fine-grained. The same analysis for the generation case is in \ref{['sup:fig:anchors-num']} in the Appendix.
  • Figure 4: Scale distribution in encodings of different pre-trained encoders on the N24News dataset.
  • Figure 5: Performance comparison between different encoders and data modalities on the N24News multimodal dataset. On the right the accuracy of models trained end-to-end on a single data modality (Score) and their average norm (Scale). On the left the stitching performance between pairs of encoders and decoder. This shows the importance of translating from good encoders, that can even improve unimodal decoder performances. Results obtained with $2000$ anchors and ortho, with an SVM as classification head. In the Appendix \ref{['sup:cross-modality']}, additional results using MLPs as decoders.
  • ...and 8 more figures