Enhancing Modality Representation and Alignment for Multimodal Cold-start Active Learning
Meng Shen, Yake Wei, Jianxiong Yin, Deepu Rajan, Di Hu, Simon See
TL;DR
This work tackles cold-start multimodal active learning by introducing MMCSAL, a two-stage framework that (i) bridges modality gaps in multimodal self-supervised learning using uni-modal prototypes and cross-modal contrastive learning, and (ii) enhances data selection by enforcing cross-modal alignment within the chosen subset. The method uses a joint objective that combines cross-modal and uni-modal losses, and a principled selection objective with distribution, diversity, and alignment terms, yielding multimodal data pairs for labeling. Across Food101, KineticsSound, and VGGSound, MMCSAL-final outperforms warm-start and several cold-start baselines, with ablations showing the complementary benefits of prototypes and alignment regularization. The approach reduces labeling costs while preserving or improving downstream multimodal performance, and provides practical guidance on hyperparameters such as $\lambda_{\text{align}}$ for alignment strength.
Abstract
Training multimodal models requires a large amount of labeled data. Active learning (AL) aim to reduce labeling costs. Most AL methods employ warm-start approaches, which rely on sufficient labeled data to train a well-calibrated model that can assess the uncertainty and diversity of unlabeled data. However, when assembling a dataset, labeled data are often scarce initially, leading to a cold-start problem. Additionally, most AL methods seldom address multimodal data, highlighting a research gap in this field. Our research addresses these issues by developing a two-stage method for Multi-Modal Cold-Start Active Learning (MMCSAL). Firstly, we observe the modality gap, a significant distance between the centroids of representations from different modalities, when only using cross-modal pairing information as self-supervision signals. This modality gap affects data selection process, as we calculate both uni-modal and cross-modal distances. To address this, we introduce uni-modal prototypes to bridge the modality gap. Secondly, conventional AL methods often falter in multimodal scenarios where alignment between modalities is overlooked. Therefore, we propose enhancing cross-modal alignment through regularization, thereby improving the quality of selected multimodal data pairs in AL. Finally, our experiments demonstrate MMCSAL's efficacy in selecting multimodal data pairs across three multimodal datasets.
