Table of Contents
Fetching ...

Unbiased Dynamic Multimodal Fusion

Shicai Wei, Kaijie Zhang, Luyi Chen, Tao He, Guiduo Duan

Abstract

Traditional multimodal methods often assume static modality quality, which limits their adaptability in dynamic real-world scenarios. Thus, dynamical multimodal methods are proposed to assess modality quality and adjust their contribution accordingly. However, they typically rely on empirical metrics, failing to measure the modality quality when noise levels are extremely low or high. Moreover, existing methods usually assume that the initial contribution of each modality is the same, neglecting the intrinsic modality dependency bias. As a result, the modality hard to learn would be doubly penalized, and the performance of dynamical fusion could be inferior to that of static fusion. To address these challenges, we propose the Unbiased Dynamic Multimodal Learning (UDML) framework. Specifically, we introduce a noise-aware uncertainty estimator that adds controlled noise to the modality data and predicts its intensity from the modality feature. This forces the model to learn a clear correspondence between feature corruption and noise level, allowing accurate uncertainty measure across both low- and high-noise conditions. Furthermore, we quantify the inherent modality reliance bias within multimodal networks via modality dropout and incorporate it into the weighting mechanism. This eliminates the dual suppression effect on the hard-to-learn modality. Extensive experiments across diverse multimodal benchmark tasks validate the effectiveness, versatility, and generalizability of the proposed UDML. The code is available at https://github.com/shicaiwei123/UDML.

Unbiased Dynamic Multimodal Fusion

Abstract

Traditional multimodal methods often assume static modality quality, which limits their adaptability in dynamic real-world scenarios. Thus, dynamical multimodal methods are proposed to assess modality quality and adjust their contribution accordingly. However, they typically rely on empirical metrics, failing to measure the modality quality when noise levels are extremely low or high. Moreover, existing methods usually assume that the initial contribution of each modality is the same, neglecting the intrinsic modality dependency bias. As a result, the modality hard to learn would be doubly penalized, and the performance of dynamical fusion could be inferior to that of static fusion. To address these challenges, we propose the Unbiased Dynamic Multimodal Learning (UDML) framework. Specifically, we introduce a noise-aware uncertainty estimator that adds controlled noise to the modality data and predicts its intensity from the modality feature. This forces the model to learn a clear correspondence between feature corruption and noise level, allowing accurate uncertainty measure across both low- and high-noise conditions. Furthermore, we quantify the inherent modality reliance bias within multimodal networks via modality dropout and incorporate it into the weighting mechanism. This eliminates the dual suppression effect on the hard-to-learn modality. Extensive experiments across diverse multimodal benchmark tasks validate the effectiveness, versatility, and generalizability of the proposed UDML. The code is available at https://github.com/shicaiwei123/UDML.
Paper Structure (16 sections, 13 equations, 3 figures, 5 tables)

This paper contains 16 sections, 13 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Visualization of dynamic multimodal methods for audio-visual classification on the CREMA-D dataset. (a) Visual weighting coefficients obtained using different uncertainty estimation methods, such as energy score (ES) qmf and probabilistic embedding (PE) eau, and the proposed UDML, as varying levels of noise ($\sigma$) are injected into the visual modality. (b) Performance Comparison of different methods under static and dynamic weighting when noise ($\sigma=5$) is injected into the visual modality.
  • Figure 2: The framework of the unbiased dynamic multimodal fusion. It consists of two parts: 1) noise-aware uncertainty estimator, which measures the modality quality; 2) modality-dependency calculator, which quantifies the model's dependency on each modality.
  • Figure 3: Visualization of dynamic multimodal fusion for audio-visual classification on the CREMA-D dataset as varying levels of noise ($\sigma$) are injected into the visual modality.