Table of Contents
Fetching ...

SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Simon Roschmann, Paul Krzakala, Sonia Mazelet, Quentin Bouniot, Zeynep Akata

TL;DR

This work introduces a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data, and proposes SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence.

Abstract

The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. Unlike existing semi-supervised methods, SOTAlign effectively leverages unpaired images and text, learning robust joint embeddings across datasets and encoder pairs, and significantly outperforming supervised and semi-supervised baselines.

SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

TL;DR

This work introduces a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data, and proposes SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence.

Abstract

The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. Unlike existing semi-supervised methods, SOTAlign effectively leverages unpaired images and text, learning robust joint embeddings across datasets and encoder pairs, and significantly outperforming supervised and semi-supervised baselines.
Paper Structure (35 sections, 7 theorems, 77 equations, 12 figures, 11 tables, 1 algorithm)

This paper contains 35 sections, 7 theorems, 77 equations, 12 figures, 11 tables, 1 algorithm.

Key Result

Theorem 5.1

For any transport plan $P \in \Pi_n$, Proof is provided in Appendix appendix_ot.

Figures (12)

  • Figure 1: Semi-Supervised Vision-Language Alignment. We tackle the alignment of frozen unimodal encoders where paired data (red blocks) is scarce but unpaired data is abundant. The key challenge is: how to define a training signal for unpaired data when ground-truth cross-modal correspondences are missing?
  • Figure 2: SOTAlign is a two-step method for the alignment of pretrained unimodal image and text encoders. First, we fit a linear alignment model only using the limited amount of available image-text pairs. Then, we use this linear model as a teacher to regularize the training of alignment layers $f$ and $g$ for a joint embedding space leveraging unimodal (unpaired) data.
  • Figure 3: GPU memory usage for a batchsize $n=10$k when computing the gradient of the OT-based divergence with naive solver unrolling (blue) and the provided explicit gradient formula (orange). Additionnal results are reported in appendix \ref{['appendix_sinkhorn']}.
  • Figure 4: Left: Effect of the number of paired samples (while fixing 1M unpaired samples). Right: Effect of the number of unpaired samples (while fixing 10k pairs). We report the zero-shot retrieval (MeanR@1) on COCO. More metrics are reported in Appendix \ref{['appendix_additionnal']}.
  • Figure 5: Relationship between the total sliced Wasserstein distance between CC3M image/text dataset and unimodal datasets, and the downstream performance of SOTAlign trained on 10k CC3M image–text pairs and up to 1M samples from the corresponding unimodal datasets.
  • ...and 7 more figures

Theorems & Definitions (13)

  • Theorem 5.1
  • Proposition 3.1: Closed form solution of Procrustes Alignment
  • proof
  • Proposition 3.2: Closed-form solution of CCA
  • proof
  • Lemma 3.3
  • proof
  • Theorem 3.4
  • proof
  • Proposition 3.5
  • ...and 3 more