Table of Contents
Fetching ...

Enhancing Modality Representation and Alignment for Multimodal Cold-start Active Learning

Meng Shen, Yake Wei, Jianxiong Yin, Deepu Rajan, Di Hu, Simon See

TL;DR

This work tackles cold-start multimodal active learning by introducing MMCSAL, a two-stage framework that (i) bridges modality gaps in multimodal self-supervised learning using uni-modal prototypes and cross-modal contrastive learning, and (ii) enhances data selection by enforcing cross-modal alignment within the chosen subset. The method uses a joint objective that combines cross-modal and uni-modal losses, and a principled selection objective with distribution, diversity, and alignment terms, yielding multimodal data pairs for labeling. Across Food101, KineticsSound, and VGGSound, MMCSAL-final outperforms warm-start and several cold-start baselines, with ablations showing the complementary benefits of prototypes and alignment regularization. The approach reduces labeling costs while preserving or improving downstream multimodal performance, and provides practical guidance on hyperparameters such as $\lambda_{\text{align}}$ for alignment strength.

Abstract

Training multimodal models requires a large amount of labeled data. Active learning (AL) aim to reduce labeling costs. Most AL methods employ warm-start approaches, which rely on sufficient labeled data to train a well-calibrated model that can assess the uncertainty and diversity of unlabeled data. However, when assembling a dataset, labeled data are often scarce initially, leading to a cold-start problem. Additionally, most AL methods seldom address multimodal data, highlighting a research gap in this field. Our research addresses these issues by developing a two-stage method for Multi-Modal Cold-Start Active Learning (MMCSAL). Firstly, we observe the modality gap, a significant distance between the centroids of representations from different modalities, when only using cross-modal pairing information as self-supervision signals. This modality gap affects data selection process, as we calculate both uni-modal and cross-modal distances. To address this, we introduce uni-modal prototypes to bridge the modality gap. Secondly, conventional AL methods often falter in multimodal scenarios where alignment between modalities is overlooked. Therefore, we propose enhancing cross-modal alignment through regularization, thereby improving the quality of selected multimodal data pairs in AL. Finally, our experiments demonstrate MMCSAL's efficacy in selecting multimodal data pairs across three multimodal datasets.

Enhancing Modality Representation and Alignment for Multimodal Cold-start Active Learning

TL;DR

This work tackles cold-start multimodal active learning by introducing MMCSAL, a two-stage framework that (i) bridges modality gaps in multimodal self-supervised learning using uni-modal prototypes and cross-modal contrastive learning, and (ii) enhances data selection by enforcing cross-modal alignment within the chosen subset. The method uses a joint objective that combines cross-modal and uni-modal losses, and a principled selection objective with distribution, diversity, and alignment terms, yielding multimodal data pairs for labeling. Across Food101, KineticsSound, and VGGSound, MMCSAL-final outperforms warm-start and several cold-start baselines, with ablations showing the complementary benefits of prototypes and alignment regularization. The approach reduces labeling costs while preserving or improving downstream multimodal performance, and provides practical guidance on hyperparameters such as for alignment strength.

Abstract

Training multimodal models requires a large amount of labeled data. Active learning (AL) aim to reduce labeling costs. Most AL methods employ warm-start approaches, which rely on sufficient labeled data to train a well-calibrated model that can assess the uncertainty and diversity of unlabeled data. However, when assembling a dataset, labeled data are often scarce initially, leading to a cold-start problem. Additionally, most AL methods seldom address multimodal data, highlighting a research gap in this field. Our research addresses these issues by developing a two-stage method for Multi-Modal Cold-Start Active Learning (MMCSAL). Firstly, we observe the modality gap, a significant distance between the centroids of representations from different modalities, when only using cross-modal pairing information as self-supervision signals. This modality gap affects data selection process, as we calculate both uni-modal and cross-modal distances. To address this, we introduce uni-modal prototypes to bridge the modality gap. Secondly, conventional AL methods often falter in multimodal scenarios where alignment between modalities is overlooked. Therefore, we propose enhancing cross-modal alignment through regularization, thereby improving the quality of selected multimodal data pairs in AL. Finally, our experiments demonstrate MMCSAL's efficacy in selecting multimodal data pairs across three multimodal datasets.

Paper Structure

This paper contains 27 sections, 12 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: MMCSAL Framework (using video-audio data as an example): In Stage 1, MMSSL is applied to a large unlabeled dataset to derive concatenated multimodal representations from unimodal features. In Stage 2, a data selection strategy samples informative multimodal data pairs, which are then annotated by human oracles. The labeled samples are subsequently used to train a downstream task model.
  • Figure 2: Our method (use audio/video as an example): In Stage 1, we employ uni-modal prototypical loss and cross-modal contrastive loss; In Stage 2, our selection reduces distribution gap while maintaining diversity and modality alignment.
  • Figure 3: Preference for data selection of different AL strategies with 5% labeling budget on Food101.
  • Figure 4: Preference for data selection of different AL strategies with 5% labeling budget on KineticsSound.
  • Figure 5: Preference for data selection of different AL strategies with 5% labeling budget on VGGSound.