Table of Contents
Fetching ...

Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

Jiancheng Zhang, Yinglun Zhu

TL;DR

This work introduces the first framework for multimodal active learning with unaligned data, where the learner must actively acquire cross-modal alignments rather than labels on pre-aligned pairs, which captures the practical bottleneck in modern multimodal pipelines such as CLIP and SigLIP.

Abstract

Active learning (AL) is a principled strategy to reduce annotation cost in data-hungry deep learning. However, existing AL algorithms focus almost exclusively on unimodal data, overlooking the substantial annotation burden in multimodal learning. We introduce the first framework for multimodal active learning with unaligned data, where the learner must actively acquire cross-modal alignments rather than labels on pre-aligned pairs. This setting captures the practical bottleneck in modern multimodal pipelines such as CLIP and SigLIP, where unimodal features are easy to obtain but high-quality alignment is costly. We develop a new algorithm that combines uncertainty and diversity principles in a modality-aware design, achieves linear-time acquisition, and applies seamlessly to both pool-based and streaming-based settings. Extensive experiments on benchmark datasets demonstrate that our approach consistently reduces multimodal annotation cost while preserving performance; for instance, on the ColorSwap dataset it cuts annotation requirements by up to $40\%$ without loss in accuracy.

Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

TL;DR

This work introduces the first framework for multimodal active learning with unaligned data, where the learner must actively acquire cross-modal alignments rather than labels on pre-aligned pairs, which captures the practical bottleneck in modern multimodal pipelines such as CLIP and SigLIP.

Abstract

Active learning (AL) is a principled strategy to reduce annotation cost in data-hungry deep learning. However, existing AL algorithms focus almost exclusively on unimodal data, overlooking the substantial annotation burden in multimodal learning. We introduce the first framework for multimodal active learning with unaligned data, where the learner must actively acquire cross-modal alignments rather than labels on pre-aligned pairs. This setting captures the practical bottleneck in modern multimodal pipelines such as CLIP and SigLIP, where unimodal features are easy to obtain but high-quality alignment is costly. We develop a new algorithm that combines uncertainty and diversity principles in a modality-aware design, achieves linear-time acquisition, and applies seamlessly to both pool-based and streaming-based settings. Extensive experiments on benchmark datasets demonstrate that our approach consistently reduces multimodal annotation cost while preserving performance; for instance, on the ColorSwap dataset it cuts annotation requirements by up to without loss in accuracy.

Paper Structure

This paper contains 41 sections, 2 theorems, 1 equation, 8 figures, 6 tables, 4 algorithms.

Key Result

Proposition 1

The per-round data acquisition complexity of alg:multimodal_AL is upper bounded by $O(B_C \cdot \lvert\mathcal{D}_t\rvert \cdot \lvert\mathcal{S}_{t-1}\rvert)$, resulting in an overall complexity of $O(T^2 \cdot B \cdot B_C \cdot \lvert\mathcal{D}\rvert)$.

Figures (8)

  • Figure 1: Results of pool-based multimodal active learning on the ColorSwap dataset with CLIP-B32 (top) and SigLIP-B16 (bottom). We report text score (left), image score (middle), and group score (right) as learning progresses.
  • Figure 2: Streaming-based multimodal active learning with the MS-COCO (left and middle) and DataComp (right) datasets using CLIP-B32. We report R@1 (image-to-text) (left), R@1 (text-to-image) (middle), and the average score across 38 downstream tasks (right). We report algorithm performance as learning progresses.
  • Figure 3: Group scores on the ColorSwap dataset in the pool-based setting, using CLIP-L14 (left), LiT-L14 (middle), and SigLIP-L16 (right).
  • Figure 4: Parameter study of \ref{['alg:multimodal_AL']} with different values of $B_C$ in the pool-based setting using CLIP-B32 and SigLIP-L16. We report group scores as learning progresses.
  • Figure 5: t-SNE visualization of image-modality embeddings from the ColorSwap dataset, comparing our method (\ref{['alg:multimodal_AL']}) with an uncertainty-based baseline (Uncertainty). Points selected by the baseline are shown as blue circles, while points selected by our method are shown as red stars. Blue density contours represent the distribution of all data. Compared to the uncertainty-based method, our approach selects samples that not only capture uncertain regions but also exhibit greater diversity across the embedding space.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Proposition 1
  • proof
  • Remark 1