Table of Contents
Fetching ...

Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models

Sharut Gupta, Shobhita Sundaram, Chenyu Wang, Stefanie Jegelka, Phillip Isola

TL;DR

The paper investigates Unpaired Multimodal Representation Learning (UML), a modality-agnostic framework that uses unpaired data from auxiliary modalities to improve unimodal representations without requiring cross-modal alignments. It provides a linear, Fisher-information-based theoretical foundation showing that combining modalities tightens uncertainty on shared latent factors, and introduces UML as a shared-weight architecture that enables cross-modal transfer in both self-supervised and supervised settings. Empirically, UML yields consistent gains across vision, text, and audio benchmarks, with stronger improvements as more modalities are added, and demonstrates transfer from language priors to vision, a quantified exchange rate between modalities, and emergence of multimodal neurons without paired supervision. The work also discusses limitations and reproducibility, and suggests practical impact for domains rich in unpaired auxiliary data, such as medical imaging and robotics, where leveraging auxiliary text, audio, or metadata can meaningfully enhance unimodal models.

Abstract

Traditional multimodal learners find unified representations for tasks like visual question answering, but rely heavily on paired datasets. However, an overlooked yet potentially powerful question is: can one leverage auxiliary unpaired multimodal data to directly enhance representation learning in a target modality? We introduce UML: Unpaired Multimodal Learner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them. This design exploits the assumption that different modalities are projections of a shared underlying reality, allowing the model to benefit from cross-modal structure without requiring explicit pairs. Theoretically, under linear data-generating assumptions, we show that unpaired auxiliary data can yield representations strictly more informative about the data-generating process than unimodal training. Empirically, we show that using unpaired data from auxiliary modalities -- such as text, audio, or images -- consistently improves downstream performance across diverse unimodal targets such as image and audio. Our project page: https://unpaired-multimodal.github.io/

Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models

TL;DR

The paper investigates Unpaired Multimodal Representation Learning (UML), a modality-agnostic framework that uses unpaired data from auxiliary modalities to improve unimodal representations without requiring cross-modal alignments. It provides a linear, Fisher-information-based theoretical foundation showing that combining modalities tightens uncertainty on shared latent factors, and introduces UML as a shared-weight architecture that enables cross-modal transfer in both self-supervised and supervised settings. Empirically, UML yields consistent gains across vision, text, and audio benchmarks, with stronger improvements as more modalities are added, and demonstrates transfer from language priors to vision, a quantified exchange rate between modalities, and emergence of multimodal neurons without paired supervision. The work also discusses limitations and reproducibility, and suggests practical impact for domains rich in unpaired auxiliary data, such as medical imaging and robotics, where leveraging auxiliary text, audio, or metadata can meaningfully enhance unimodal models.

Abstract

Traditional multimodal learners find unified representations for tasks like visual question answering, but rely heavily on paired datasets. However, an overlooked yet potentially powerful question is: can one leverage auxiliary unpaired multimodal data to directly enhance representation learning in a target modality? We introduce UML: Unpaired Multimodal Learner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them. This design exploits the assumption that different modalities are projections of a shared underlying reality, allowing the model to benefit from cross-modal structure without requiring explicit pairs. Theoretically, under linear data-generating assumptions, we show that unpaired auxiliary data can yield representations strictly more informative about the data-generating process than unimodal training. Empirically, we show that using unpaired data from auxiliary modalities -- such as text, audio, or images -- consistently improves downstream performance across diverse unimodal targets such as image and audio. Our project page: https://unpaired-multimodal.github.io/

Paper Structure

This paper contains 65 sections, 9 theorems, 46 equations, 47 figures, 29 tables, 2 algorithms.

Key Result

Theorem 1

Let $\hat{\theta}_X, \hat{\theta}_Y$ be the least-squares estimators for $\theta$ using only $\{X_i\}$ and only $\{Y_j\}$ and let $\hat{\theta}_{X,Y}$ be the joint estimator using both unpaired datasets. Then, under the assumption that at least one $B_{c,j}$, where $j \in \{1,2,... N_y \}$, has full

Figures (47)

  • Figure 1: Text provides complementary information beyond images, even when not paired directly; We introduce Unpaired Multimodal Learner (Uml) whereby sharing model weights across modalities (e.g., image and text) extracts synergies and enhances unimodal representations, outperforming methods that rely only on a single modality (such as images above).
  • Figure 2: (a) Paired learning uses ${(x_i, y_i)}$ with known correspondences. We instead study Unpaired learning: (b) with labels, using ${(x_i, c_i)}$ and ${(y_j, \hat{c}_j)}$, where $c_i$ and $\hat{c}_j$ denote labels for $x_i$ and $y_j$, but no cross-modal correspondences; and (c) without any labels or correspondences, using $\{x_i\}$ and $\{y_j\}$.
  • Figure 3: Adding unpaired $Y$ samples boosts $X$ reconstruction more than adding extra $X$ samples.
  • Figure 4: (Left) Inputs from different modalities (e.g., images or text) are tokenized into patch or token embeddings using pretrained encoders or processed features; (Right) Uml can be trained under two settings: (a) Self-supervised, where patch/token embeddings are passed through a shared network and modality-specific decoders to perform next-token/patch prediction; (b) Supervised, where mean/CLS embeddings are fed through the shared classifier to predict labels within each modality.
  • Figure 5: Our approach Uml is much more robust than its unimodal counterpart across four test time distribution shifted target test sets. All results are averaged across three random seeds.
  • ...and 42 more figures

Theorems & Definitions (22)

  • Theorem 1: Variance Reduction with Unpaired Multimodal Data
  • Theorem 2: Directional Variance Reduction with Unpaired Multimodal Data
  • Theorem 3: Data from Auxiliary Modality Can Outperform More of the Same
  • Definition 1: Positive Semidefinite Matrix
  • Definition 2: Positive Definite Matrix
  • Definition 3: Loewner Order
  • Definition 4: Fisher Information Matrix
  • Lemma 1: Loewner Order reversal for inverses
  • proof
  • Lemma 2: Inverse–monotonicity of the Moore--Penrose pseudoinverse
  • ...and 12 more