Table of Contents
Fetching ...

It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

Dominik Schnaus, Nikita Araslanov, Daniel Cremers

TL;DR

This work tackles unsupervised vision–language alignment without parallel data by formulating it as a quadratic assignment problem over intra-modality pairwise similarities. It introduces a memory-efficient, factorized Hahn-Grant solver that yields tight bounds and strong primal solutions (scaling roughly as $\mathcal{O}(N^5)$) and demonstrates feasibility across 33 vision and 27 language models on four datasets, including a proof-of-concept unsupervised classifier that assigns image concepts without paired annotations. The study further shows how to select optimal class subsets via a $p$-dispersion-sum formulation and compares multiple solvers, establishing the superiority of the proposed approach in finding meaningful blind matches up to moderate sizes. Overall, the paper provides both methodological and empirical evidence that vision–language correspondence can emerge in an annotation-free setting, while outlining key limitations related to scale, symmetry, and concept coverage with current models.

Abstract

The platonic representation hypothesis suggests that vision and language embeddings become more homogeneous as model and dataset sizes increase. In particular, pairwise distances within each modality become more similar. This suggests that as foundation models mature, it may become possible to match vision and language embeddings in a fully unsupervised fashion, i.e. without parallel data. We present the first feasibility study, and investigate conformity of existing vision and language foundation models in the context of unsupervised, or "blind", matching. First, we formulate unsupervised matching as a quadratic assignment problem and introduce a novel heuristic that outperforms previous solvers. We also develop a technique to find optimal matching problems, for which a non-trivial match is very likely. Second, we conduct an extensive study deploying a range of vision and language models on four datasets. Our analysis reveals that for many problem instances, vision and language representations can be indeed matched without supervision. This finding opens up the exciting possibility of embedding semantic knowledge into other modalities virtually annotation-free. As a proof of concept, we showcase an unsupervised classifier, which achieves non-trivial classification accuracy without any image-text annotation.

It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

TL;DR

This work tackles unsupervised vision–language alignment without parallel data by formulating it as a quadratic assignment problem over intra-modality pairwise similarities. It introduces a memory-efficient, factorized Hahn-Grant solver that yields tight bounds and strong primal solutions (scaling roughly as ) and demonstrates feasibility across 33 vision and 27 language models on four datasets, including a proof-of-concept unsupervised classifier that assigns image concepts without paired annotations. The study further shows how to select optimal class subsets via a -dispersion-sum formulation and compares multiple solvers, establishing the superiority of the proposed approach in finding meaningful blind matches up to moderate sizes. Overall, the paper provides both methodological and empirical evidence that vision–language correspondence can emerge in an annotation-free setting, while outlining key limitations related to scale, symmetry, and concept coverage with current models.

Abstract

The platonic representation hypothesis suggests that vision and language embeddings become more homogeneous as model and dataset sizes increase. In particular, pairwise distances within each modality become more similar. This suggests that as foundation models mature, it may become possible to match vision and language embeddings in a fully unsupervised fashion, i.e. without parallel data. We present the first feasibility study, and investigate conformity of existing vision and language foundation models in the context of unsupervised, or "blind", matching. First, we formulate unsupervised matching as a quadratic assignment problem and introduce a novel heuristic that outperforms previous solvers. We also develop a technique to find optimal matching problems, for which a non-trivial match is very likely. Second, we conduct an extensive study deploying a range of vision and language models on four datasets. Our analysis reveals that for many problem instances, vision and language representations can be indeed matched without supervision. This finding opens up the exciting possibility of embedding semantic knowledge into other modalities virtually annotation-free. As a proof of concept, we showcase an unsupervised classifier, which achieves non-trivial classification accuracy without any image-text annotation.

Paper Structure

This paper contains 26 sections, 34 equations, 13 figures, 7 tables, 4 algorithms.

Figures (13)

  • Figure 1: Blind matching of vision and language: Text and images are both abstractions of the same underlying world. Vision and language encoders $f_v$ and $f_l$ learn similar pairwise relations between concepts, e.g."cat" is closer to "dog" than to "airplane". We exploit these pairwise relations in our factorized Hahn-Grant solver to find valid correspondences between vision and language without any parallel data.
  • Figure 2: Shuffling degrades vision-language alignment: The vision-language alignment (here, measured by Mutual k-NN) monotonically decreases as we increasingly shuffle the oracle assignment. This holds for all considered metrics, justifying their use in the optimization objective. We encourage zooming in.
  • Figure 3: Hahn-Grant solver hahn1998lower
  • Figure 4: Most vision and language models can be matched non-trivially: We visualize the accuracy for multiple vision models with the all-mpnet-base-v2 Reimers:2019:SBE language model on CIFAR-10 krizhevsky2009learning (left) and with the All-Roberta-large-v1 Reimers:2019:SBE language model on CINIC-10 darlow2018cinic (right). The error bar shows the standard deviation for 20 random seeds and the dashed line shows the performance of random matching. Observe that most vision representations can be matched with high accuracy to language, and the pre-training method has a greater impact than the model size. DINOv2 Oquab:2024:DIN achieves the highest accuracy, on average.
  • Figure 5: Some fine-grained problems can be matched with high accuracy: For each problem size, we select the optimal ten subsets of classes using the optimization from \ref{['sec:optimal_subset']} on ImageNet-100 russakovsky2015imagenet (left) and CIFAR-100 krizhevsky2009learning (right). We show the matching accuracy for all optimization problems, each with three random seeds for the vision models DINOv2 Oquab:2024:DIN, CLIP Ramesh:2022:HTC, and DeiT touvron2021training using the all-mpnet-base-v2 language model Reimers:2019:SBE. We observe that we can find optimization problems, especially for $N < 40$, that lead to accurate matching. On both datasets, CLIP performs best for most problem sizes.
  • ...and 8 more figures