Table of Contents
Fetching ...

MIDAS: Misalignment-based Data Augmentation Strategy for Imbalanced Multimodal Learning

Seong-Hyeon Hwang, Soyoung Choi, Steven Euijong Whang

TL;DR

MIDAS tackles modality imbalance in multimodal learning by treating misaligned samples as informative supervision. It generates misaligned pairs and labels them with a unimodal-confidence-based soft target, then strengthens learning from weaker modalities via a dynamic weak-modality weight and prioritizes harder, more semantically ambiguous misaligned samples through hard-sample weighting. The approach yields consistent improvements over strong baselines across multiple datasets, demonstrating improved modality balance and discriminative power. This data-centric augmentation offers a practical path to robust, balanced multimodal representations with potential applicability beyond classification.

Abstract

Multimodal models often over-rely on dominant modalities, failing to achieve optimal performance. While prior work focuses on modifying training objectives or optimization procedures, data-centric solutions remain underexplored. We propose MIDAS, a novel data augmentation strategy that generates misaligned samples with semantically inconsistent cross-modal information, labeled using unimodal confidence scores to compel learning from contradictory signals. However, this confidence-based labeling can still favor the more confident modality. To address this within our misaligned samples, we introduce weak-modality weighting, which dynamically increases the loss weight of the least confident modality, thereby helping the model fully utilize weaker modality. Furthermore, when misaligned features exhibit greater similarity to the aligned features, these misaligned samples pose a greater challenge, thereby enabling the model to better distinguish between classes. To leverage this, we propose hard-sample weighting, which prioritizes such semantically ambiguous misaligned samples. Experiments on multiple multimodal classification benchmarks demonstrate that MIDAS significantly outperforms related baselines in addressing modality imbalance.

MIDAS: Misalignment-based Data Augmentation Strategy for Imbalanced Multimodal Learning

TL;DR

MIDAS tackles modality imbalance in multimodal learning by treating misaligned samples as informative supervision. It generates misaligned pairs and labels them with a unimodal-confidence-based soft target, then strengthens learning from weaker modalities via a dynamic weak-modality weight and prioritizes harder, more semantically ambiguous misaligned samples through hard-sample weighting. The approach yields consistent improvements over strong baselines across multiple datasets, demonstrating improved modality balance and discriminative power. This data-centric augmentation offers a practical path to robust, balanced multimodal representations with potential applicability beyond classification.

Abstract

Multimodal models often over-rely on dominant modalities, failing to achieve optimal performance. While prior work focuses on modifying training objectives or optimization procedures, data-centric solutions remain underexplored. We propose MIDAS, a novel data augmentation strategy that generates misaligned samples with semantically inconsistent cross-modal information, labeled using unimodal confidence scores to compel learning from contradictory signals. However, this confidence-based labeling can still favor the more confident modality. To address this within our misaligned samples, we introduce weak-modality weighting, which dynamically increases the loss weight of the least confident modality, thereby helping the model fully utilize weaker modality. Furthermore, when misaligned features exhibit greater similarity to the aligned features, these misaligned samples pose a greater challenge, thereby enabling the model to better distinguish between classes. To leverage this, we propose hard-sample weighting, which prioritizes such semantically ambiguous misaligned samples. Experiments on multiple multimodal classification benchmarks demonstrate that MIDAS significantly outperforms related baselines in addressing modality imbalance.

Paper Structure

This paper contains 43 sections, 12 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) Accuracy comparison between Joint training and our method on aligned (original) and misaligned validation data. (b) Comparison of modality confidence scores between Joint training and our method when predicting misaligned validation data on the Kinetics-Sounds dataset.
  • Figure 2: MIDAS trains a multimodal model on both aligned and misaligned samples with conflicting semantics simultaneously. MIDAS consists of three main components: 1) We label misaligned samples with a confidence-based labeling strategy using unimodal classifiers. 2) Weak-modality weighting increases the loss weight of the least confident modality. 3) Hard-sample weighting assigns a higher loss weight to more confusing misaligned samples containing similar semantics.
  • Figure 3: Normalized confidence score comparison between (a) MIDAS without the weak-modality weighting and (b) MIDAS with the weak-modality weighting on the CREMA-D dataset.
  • Figure 4: Trends of weak-modality weight $\alpha$ during training on four datasets.
  • Figure 5: Model confidence curves of Joint training and MIDAS for each modality on (a) CREMA-D (audio, A; video, V) and (b) UCF-101 (optical flow, OF; RGB frame, RF) datasets.
  • ...and 2 more figures