Learning Shared Representations from Unpaired Data
Amitai Yacobi, Nir Ben-Ari, Ronen Talmon, Uri Shaham
TL;DR
The paper tackles the data bottleneck in cross-modal learning by aiming to build universal cross-modal embeddings largely from unpaired data. It introduces Spectral Universal Embedding (SUE), a three-stage pipeline that learns modality-specific spectral embeddings, aligns them with a small set of paired samples via CCA, and refines the alignment with an MMD-based residual network to produce a shared embedding space. Empirically, SUE achieves strong cross-modal retrieval, enables almost-text-free image generation and arithmetic, demonstrates zero-shot capabilities, and enables emergent cross-domain classification using far fewer paired samples than contrastive baselines. The work suggests that universal embeddings can arise from the spectral structure of unimodal representations and points to a promising direction for fully unpaired multimodal learning across diverse modalities.
Abstract
Learning shared representations is a primary area of multimodal representation learning. The current approaches to achieve a shared embedding space rely heavily on paired samples from each modality, which are significantly harder to obtain than unpaired ones. In this work, we demonstrate that shared representations can be learned almost exclusively from unpaired data. Our arguments are grounded in the spectral embeddings of the random walk matrices constructed independently from each unimodal representation. Empirical results in computer vision and natural language processing domains support its potential, revealing the effectiveness of unpaired data in capturing meaningful cross-modal relations, demonstrating high capabilities in retrieval tasks, generation, arithmetics, zero-shot, and cross-domain classification. This work, to the best of our knowledge, is the first to demonstrate these capabilities almost exclusively from unpaired samples, giving rise to a cross-modal embedding that could be viewed as universal, i.e., independent of the specific modalities of the data. Our project page: https://shaham-lab.github.io/SUE_page.
