Unbiased Dynamic Multimodal Fusion

Shicai Wei; Kaijie Zhang; Luyi Chen; Tao He; Guiduo Duan

Unbiased Dynamic Multimodal Fusion

Shicai Wei, Kaijie Zhang, Luyi Chen, Tao He, Guiduo Duan

Abstract

Traditional multimodal methods often assume static modality quality, which limits their adaptability in dynamic real-world scenarios. Thus, dynamical multimodal methods are proposed to assess modality quality and adjust their contribution accordingly. However, they typically rely on empirical metrics, failing to measure the modality quality when noise levels are extremely low or high. Moreover, existing methods usually assume that the initial contribution of each modality is the same, neglecting the intrinsic modality dependency bias. As a result, the modality hard to learn would be doubly penalized, and the performance of dynamical fusion could be inferior to that of static fusion. To address these challenges, we propose the Unbiased Dynamic Multimodal Learning (UDML) framework. Specifically, we introduce a noise-aware uncertainty estimator that adds controlled noise to the modality data and predicts its intensity from the modality feature. This forces the model to learn a clear correspondence between feature corruption and noise level, allowing accurate uncertainty measure across both low- and high-noise conditions. Furthermore, we quantify the inherent modality reliance bias within multimodal networks via modality dropout and incorporate it into the weighting mechanism. This eliminates the dual suppression effect on the hard-to-learn modality. Extensive experiments across diverse multimodal benchmark tasks validate the effectiveness, versatility, and generalizability of the proposed UDML. The code is available at https://github.com/shicaiwei123/UDML.

Unbiased Dynamic Multimodal Fusion

Abstract

Paper Structure (16 sections, 13 equations, 3 figures, 5 tables)

This paper contains 16 sections, 13 equations, 3 figures, 5 tables.

Introduction
Related Work
Dynamic Multimodal Learning
Imbalanced Multimodal Learning
Methods
Re-analyze the Dynamic Multimodal Learning
Unbiased Dynamic Multimodal Learning
Noise-aware Uncertainty Estimator
Modality-dependency Calculator
Progressive Optimization Strategy
Experiments
Experimental Settings
Experimental Results
Ablation Study
Conclusion
...and 1 more sections

Figures (3)

Figure 1: Visualization of dynamic multimodal methods for audio-visual classification on the CREMA-D dataset. (a) Visual weighting coefficients obtained using different uncertainty estimation methods, such as energy score (ES) qmf and probabilistic embedding (PE) eau, and the proposed UDML, as varying levels of noise ($\sigma$) are injected into the visual modality. (b) Performance Comparison of different methods under static and dynamic weighting when noise ($\sigma=5$) is injected into the visual modality.
Figure 2: The framework of the unbiased dynamic multimodal fusion. It consists of two parts: 1) noise-aware uncertainty estimator, which measures the modality quality; 2) modality-dependency calculator, which quantifies the model's dependency on each modality.
Figure 3: Visualization of dynamic multimodal fusion for audio-visual classification on the CREMA-D dataset as varying levels of noise ($\sigma$) are injected into the visual modality.

Unbiased Dynamic Multimodal Fusion

Abstract

Unbiased Dynamic Multimodal Fusion

Authors

Abstract

Table of Contents

Figures (3)