Table of Contents
Fetching ...

With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You

Fabian Gröger, Shuo Wen, Huyen Le, Maria Brbić

TL;DR

This work tackles the challenge of limited data for multimodal alignment by freezing pretrained unimodal encoders and learning lightweight alignment functions guided by a geometry-preserving regularizer. The STRUCTURE regularizer enforces multi-scale neighborhood consistency between each modality's latent space and the shared embedding, while selecting the most representationally similar layer pairs to align. Empirically, the approach yields substantial gains across 24 zero-shot classification and retrieval benchmarks (average improvements of $51.6\%$ and $91.8\%$, respectively) and remains effective under extreme data scarcity, even approaching large multimodal models when a few in-domain labels are added. The results suggest that preserving pretrained geometry and targeted layer selection can dramatically improve resource-efficient multimodal learning in practical, domain-constrained settings.

Abstract

Multimodal models have demonstrated powerful capabilities in complex tasks requiring multimodal alignment, including zero-shot classification and cross-modal retrieval. However, existing models typically rely on millions of paired multimodal samples, which are prohibitively expensive or infeasible to obtain in many domains. In this work, we explore the feasibility of building multimodal models with limited amount of paired data by aligning pretrained unimodal foundation models. We show that high-quality alignment is possible with as few as tens of thousands of paired samples$\unicode{x2013}$less than $1\%$ of the data typically used in the field. To achieve this, we introduce STRUCTURE, an effective regularization technique that preserves the neighborhood geometry of the latent space of unimodal encoders. Additionally, we show that aligning last layers is often suboptimal and demonstrate the benefits of aligning the layers with the highest representational similarity across modalities. These two components can be readily incorporated into existing alignment methods, yielding substantial gains across 24 zero-shot image classification and retrieval benchmarks, with average relative improvement of $51.6\%$ in classification and $91.8\%$ in retrieval tasks. Our results highlight the effectiveness and broad applicability of our framework for limited-sample multimodal learning and offer a promising path forward for resource-constrained domains.

With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You

TL;DR

This work tackles the challenge of limited data for multimodal alignment by freezing pretrained unimodal encoders and learning lightweight alignment functions guided by a geometry-preserving regularizer. The STRUCTURE regularizer enforces multi-scale neighborhood consistency between each modality's latent space and the shared embedding, while selecting the most representationally similar layer pairs to align. Empirically, the approach yields substantial gains across 24 zero-shot classification and retrieval benchmarks (average improvements of and , respectively) and remains effective under extreme data scarcity, even approaching large multimodal models when a few in-domain labels are added. The results suggest that preserving pretrained geometry and targeted layer selection can dramatically improve resource-efficient multimodal learning in practical, domain-constrained settings.

Abstract

Multimodal models have demonstrated powerful capabilities in complex tasks requiring multimodal alignment, including zero-shot classification and cross-modal retrieval. However, existing models typically rely on millions of paired multimodal samples, which are prohibitively expensive or infeasible to obtain in many domains. In this work, we explore the feasibility of building multimodal models with limited amount of paired data by aligning pretrained unimodal foundation models. We show that high-quality alignment is possible with as few as tens of thousands of paired samplesless than of the data typically used in the field. To achieve this, we introduce STRUCTURE, an effective regularization technique that preserves the neighborhood geometry of the latent space of unimodal encoders. Additionally, we show that aligning last layers is often suboptimal and demonstrate the benefits of aligning the layers with the highest representational similarity across modalities. These two components can be readily incorporated into existing alignment methods, yielding substantial gains across 24 zero-shot image classification and retrieval benchmarks, with average relative improvement of in classification and in retrieval tasks. Our results highlight the effectiveness and broad applicability of our framework for limited-sample multimodal learning and offer a promising path forward for resource-constrained domains.

Paper Structure

This paper contains 37 sections, 3 theorems, 21 equations, 12 figures, 12 tables.

Key Result

Lemma 1

The generalization gap between the empirical and the expected values of $\mathcal{R}_{\mathrm{S}}$ is bounded by where $\hat{\mathcal{R}}_N$ is the empirical STRUCTURE regularizer and ${\mathcal{R}}^\star$ is its expectation over the data distribution. This shows that the empirical regularizer faithfully approximates its population expectation as the number of samples increases.

Figures (12)

  • Figure 1: Overview of the proposed approach for cross-modal alignment with limited data. The objective is to align representations from two modalities (e.g., images and text) into a shared embedding space. The central challenge is guiding the model toward a well-aligned solution, rather than a misaligned one, when only a small amount of paired data is available. The key idea is to freeze pretrained encoders and learn lightweight alignment functions that preserve each modality's pretrained latent structure during alignment.
  • Figure 2: Zero-shot performance of different model combinations when aligning different layers as a function of their representational similarity measured in mutual kNN (MkNN). Here, the star indicates the performance achieved when aligning the last layers of the models, and $\rho$ is the average Spearman's rank correlation coefficient across different datasets.
  • Figure 3: Comparison of zero-shot and retrieval performance for linear alignment when scaling down the training data, repeated five times for each sample size. Here, $U$ quantifies the proposed method's label efficiency by computing the utility compared to using the last layer.
  • Figure 4: Zero-shot performance when randomly adding LAION samples to the COCO training set, repeated three times, when aligning the best layers and adding the regularization.
  • Figure 5: Zero-shot classification performance as in-domain samples are added to the training set. Performance is evaluated on multiple fine-grained in-domain tasks, where the added and evaluation samples come from the same dataset. Here, the star indicates the CLIP performance.
  • ...and 7 more figures

Theorems & Definitions (5)

  • Lemma 1: Generalization bound
  • Lemma 2: Per‑sample sensitivity
  • proof
  • Lemma 3: Generalization bound
  • proof