Table of Contents
Fetching ...

Contribution-Guided Asymmetric Learning for Robust Multimodal Fusion under Imbalance and Noise

Zijing Xu, Yunfeng Kou, Kunming Wu, Hong Liu

TL;DR

CAL introduces a contribution-guided, asymmetric learning framework for robust multimodal fusion under imbalance and noise. It jointly exploits a Shapley-inspired modality contribution metric and an asymmetric gradient modulation with a dynamic Softmax-based weighting, plus an asymmetric information bottleneck to compress noise while preserving task-relevant signals. The method achieves state-of-the-art results on five benchmarks and demonstrates strong robustness to various noise attacks, supported by comprehensive ablations and visualizations. CAL offers a modular, transferable approach with practical impact for real-world multimodal systems where modality value differs and data quality varies.

Abstract

Multimodal learning faces two major challenges: modality imbalance and data noise, which significantly affect the robustness and generalization ability of models. Existing methods achieve modality balance by suppressing dominant modalities, but they neglect the inherent differences in the information value between modalities, potentially leading to convergence to suboptimal solutions. This paper proposes an innovative modality compression paradigm, Contribution-Guided Asymmetric Learning (CAL), which aims to enhance the contribution of high-contribution modalities while compressing weak modalities to increase their contribution, allowing both to improve the performance of multimodal information fusion. CAL is based on a modality contribution metric W^m combining the information quantity I(m) and confidence D(m), and it designs an asymmetric gradient acceleration mechanism and a contribution-aware Asymmetric Information Bottleneck (AIB) compression mechanism. The former accelerates the gradient update of modalities, while the latter dynamically compresses the noise of low-contribution modalities. On five benchmark datasets, including emotion recognition, scene recognition, and event localization tasks, CAL has shown outstanding performance in imbalanced fusion tasks and noise robustness tests. On CREMA-D, KS, and AVE, CAL achieves 79.30%, 74.82%, and 74.21% accuracy, significantly outperforming the existing state-of-the-art model ARL. In high-noise robustness tests, CAL also achieved leading performance under various attack strategies on the MVSA-Single and NYUD2 datasets. These results validate the significant advantages of CAL in modality imbalance and noise interference. CAL, as a flexible and efficient framework, is easy to transfer to other tasks and has broad adaptability and potential application prospects.

Contribution-Guided Asymmetric Learning for Robust Multimodal Fusion under Imbalance and Noise

TL;DR

CAL introduces a contribution-guided, asymmetric learning framework for robust multimodal fusion under imbalance and noise. It jointly exploits a Shapley-inspired modality contribution metric and an asymmetric gradient modulation with a dynamic Softmax-based weighting, plus an asymmetric information bottleneck to compress noise while preserving task-relevant signals. The method achieves state-of-the-art results on five benchmarks and demonstrates strong robustness to various noise attacks, supported by comprehensive ablations and visualizations. CAL offers a modular, transferable approach with practical impact for real-world multimodal systems where modality value differs and data quality varies.

Abstract

Multimodal learning faces two major challenges: modality imbalance and data noise, which significantly affect the robustness and generalization ability of models. Existing methods achieve modality balance by suppressing dominant modalities, but they neglect the inherent differences in the information value between modalities, potentially leading to convergence to suboptimal solutions. This paper proposes an innovative modality compression paradigm, Contribution-Guided Asymmetric Learning (CAL), which aims to enhance the contribution of high-contribution modalities while compressing weak modalities to increase their contribution, allowing both to improve the performance of multimodal information fusion. CAL is based on a modality contribution metric W^m combining the information quantity I(m) and confidence D(m), and it designs an asymmetric gradient acceleration mechanism and a contribution-aware Asymmetric Information Bottleneck (AIB) compression mechanism. The former accelerates the gradient update of modalities, while the latter dynamically compresses the noise of low-contribution modalities. On five benchmark datasets, including emotion recognition, scene recognition, and event localization tasks, CAL has shown outstanding performance in imbalanced fusion tasks and noise robustness tests. On CREMA-D, KS, and AVE, CAL achieves 79.30%, 74.82%, and 74.21% accuracy, significantly outperforming the existing state-of-the-art model ARL. In high-noise robustness tests, CAL also achieved leading performance under various attack strategies on the MVSA-Single and NYUD2 datasets. These results validate the significant advantages of CAL in modality imbalance and noise interference. CAL, as a flexible and efficient framework, is easy to transfer to other tasks and has broad adaptability and potential application prospects.

Paper Structure

This paper contains 20 sections, 15 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The CAL architecture.
  • Figure 2: Left: Performance of the four gradient modulation strategies (Strong, None, OGM, Weak) on the ACC-epoch curve. Right: Comparison between the Strong strategy and ARL strategy for the audio and visual modalities.
  • Figure 3: t-SNE feature distribution visualization for three model configurations on the CREMAD dataset. (a) Untrained model. (b) Baseline model without AIB LOSS. (c) AIB LOSS model. The ellipses represent the 95% confidence regions for each feature type. All visualizations share a unified coordinate system for direct comparison.
  • Figure 4: Left: Relationship between single-modality accuracy and its contribution. Right: Relationship between fusion accuracy and modality contribution. The shaded area indicates the 95% confidence interval.