Table of Contents
Fetching ...

Learning Shared Representations from Unpaired Data

Amitai Yacobi, Nir Ben-Ari, Ronen Talmon, Uri Shaham

TL;DR

The paper tackles the data bottleneck in cross-modal learning by aiming to build universal cross-modal embeddings largely from unpaired data. It introduces Spectral Universal Embedding (SUE), a three-stage pipeline that learns modality-specific spectral embeddings, aligns them with a small set of paired samples via CCA, and refines the alignment with an MMD-based residual network to produce a shared embedding space. Empirically, SUE achieves strong cross-modal retrieval, enables almost-text-free image generation and arithmetic, demonstrates zero-shot capabilities, and enables emergent cross-domain classification using far fewer paired samples than contrastive baselines. The work suggests that universal embeddings can arise from the spectral structure of unimodal representations and points to a promising direction for fully unpaired multimodal learning across diverse modalities.

Abstract

Learning shared representations is a primary area of multimodal representation learning. The current approaches to achieve a shared embedding space rely heavily on paired samples from each modality, which are significantly harder to obtain than unpaired ones. In this work, we demonstrate that shared representations can be learned almost exclusively from unpaired data. Our arguments are grounded in the spectral embeddings of the random walk matrices constructed independently from each unimodal representation. Empirical results in computer vision and natural language processing domains support its potential, revealing the effectiveness of unpaired data in capturing meaningful cross-modal relations, demonstrating high capabilities in retrieval tasks, generation, arithmetics, zero-shot, and cross-domain classification. This work, to the best of our knowledge, is the first to demonstrate these capabilities almost exclusively from unpaired samples, giving rise to a cross-modal embedding that could be viewed as universal, i.e., independent of the specific modalities of the data. Our project page: https://shaham-lab.github.io/SUE_page.

Learning Shared Representations from Unpaired Data

TL;DR

The paper tackles the data bottleneck in cross-modal learning by aiming to build universal cross-modal embeddings largely from unpaired data. It introduces Spectral Universal Embedding (SUE), a three-stage pipeline that learns modality-specific spectral embeddings, aligns them with a small set of paired samples via CCA, and refines the alignment with an MMD-based residual network to produce a shared embedding space. Empirically, SUE achieves strong cross-modal retrieval, enables almost-text-free image generation and arithmetic, demonstrates zero-shot capabilities, and enables emergent cross-domain classification using far fewer paired samples than contrastive baselines. The work suggests that universal embeddings can arise from the spectral structure of unimodal representations and points to a promising direction for fully unpaired multimodal learning across diverse modalities.

Abstract

Learning shared representations is a primary area of multimodal representation learning. The current approaches to achieve a shared embedding space rely heavily on paired samples from each modality, which are significantly harder to obtain than unpaired ones. In this work, we demonstrate that shared representations can be learned almost exclusively from unpaired data. Our arguments are grounded in the spectral embeddings of the random walk matrices constructed independently from each unimodal representation. Empirical results in computer vision and natural language processing domains support its potential, revealing the effectiveness of unpaired data in capturing meaningful cross-modal relations, demonstrating high capabilities in retrieval tasks, generation, arithmetics, zero-shot, and cross-domain classification. This work, to the best of our knowledge, is the first to demonstrate these capabilities almost exclusively from unpaired samples, giving rise to a cross-modal embedding that could be viewed as universal, i.e., independent of the specific modalities of the data. Our project page: https://shaham-lab.github.io/SUE_page.

Paper Structure

This paper contains 76 sections, 1 theorem, 10 equations, 15 figures, 13 tables, 1 algorithm.

Key Result

Theorem 1

(berard1994embedding, Thm. 21) Let $h$ be any metric on $\mathcal{M}$ such that $(1-\epsilon)g \leq h \leq (1 + \epsilon)$, $\epsilon < \epsilon_0$. We assume furthermore that the metrics under consideration have their Ricci curvatures bounded from below by $-(n-1)K^2$ for some constant $K$. There e

Figures (15)

  • Figure 1: Empirical demonstration of universality. (a) Distances between corresponding random walks on image and text graphs from MSCOCO, compared to distances to randomly shuffled (non-matching) walks. Although constructed independently from unimodal features, corresponding walks exhibit significantly greater similarity. (b) Distances between paired and unpaired points in the shared space of aligned 2D spectral embeddings (SEs). Paired points are consistently closer, indicating that the independently learned SEs capture analogous structure across modalities (see App. \ref{['app:rw_exp']}).
  • Figure 2: Almost exclusively unpaired image retrieval. Retrieved images by SUE for custom captions on MSCOCO, trained with 100 pairs and 10k non-pairs. Despite minimal paired data, the results semantically align closely with text queries.
  • Figure 3: SUE's overview. The modalities (represented by their unimodal embeddings) represent an unobserved universal (semantic) distribution; the SE is capable of retrieving this universal structure, up to rotations; CCA on a minimal number of pairs enable linear alignment between the modalities, but not sufficient for a joint universal embedding; the MMD then fixes the misalignment between the modalities, integrating them into the universal embedding space.
  • Figure 4: (a) Images retrieval examples. SUE captures cross-modal semantic structure. Top four retrieved images for text and shoe-edge queries from Flickr30k and edges2shoes. True pairs are excluded, yet results remain semantically aligned. (b) (Almost-) Text-free text-to-image generation examples. Images generated from various text queries using a generator and a converter trained exclusively on images. (c) (Almost-) Text-free arithmetics examples. Images generated from sums of text and image queries - for example, “with sunglasses” + man/woman image yields a corresponding result with sunglasses.
  • Figure 5: (a) Contrastive requires an order of magnitude more pairs to achieve similar results as SUE in the weakly-paired regime. Recall@10 results on MSCOCO by SUE (with 100 pairs) and Contrastive with various numbers of pairs. SUE exploits unpaired data to outperform contrastive learning when limited pairs are available. An order of magnitude more pairs are required to achieve similar results with contrastive learning; (b-c) Effect of #unpaired and #paired samples on Recall@10 results on image retrieval on the Flickr30k dataset. (b) SUE improves as the amount of unpaired data is increased.(c) SUE relies on non-pairs instead of pairs. SUE relies minimally on paired data, while substantially on unpaired data, enabling it to enhance its performance with additional unpaired samples, which are much easier to obtain.
  • ...and 10 more figures

Theorems & Definitions (1)

  • Theorem 1