Table of Contents
Fetching ...

Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models

Siqi Lu, Wanying Xu, Yongbin Zheng, Wenting Luan, Peng Sun, Jianhang Yao

TL;DR

This work introduces a Frequency Ratio Metric (FRM) to quantify modality preference by analyzing features in the frequency domain, and proposes a Multimodal Weight Allocation Module, a plug-and-play component that dynamically re-balances the contribution of each branch during training, promoting a more holistic learning paradigm.

Abstract

Missing modalities present a fundamental challenge in multimodal models, often causing catastrophic performance degradation. Our observations suggest that this fragility stems from an imbalanced learning process, where the model develops an implicit preference for certain modalities, leading to the under-optimization of others. We propose a simple yet efficient method to address this challenge. The central insight of our work is that the dominance relationship between modalities can be effectively discerned and quantified in the frequency domain. To leverage this principle, we first introduce a Frequency Ratio Metric (FRM) to quantify modality preference by analyzing features in the frequency domain. Guided by FRM, we then propose a Multimodal Weight Allocation Module, a plug-and-play component that dynamically re-balances the contribution of each branch during training, promoting a more holistic learning paradigm. Extensive experiments demonstrate that MWAM can be seamlessly integrated into diverse architectural backbones, such as those based on CNNs and ViTs. Furthermore, MWAM delivers consistent performance gains across a wide range of tasks and modality combinations. This advancement extends beyond merely optimizing the performance of the base model; it also manifests as further performance improvements to state-of-the-art methods addressing the missing modality problem.

Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models

TL;DR

This work introduces a Frequency Ratio Metric (FRM) to quantify modality preference by analyzing features in the frequency domain, and proposes a Multimodal Weight Allocation Module, a plug-and-play component that dynamically re-balances the contribution of each branch during training, promoting a more holistic learning paradigm.

Abstract

Missing modalities present a fundamental challenge in multimodal models, often causing catastrophic performance degradation. Our observations suggest that this fragility stems from an imbalanced learning process, where the model develops an implicit preference for certain modalities, leading to the under-optimization of others. We propose a simple yet efficient method to address this challenge. The central insight of our work is that the dominance relationship between modalities can be effectively discerned and quantified in the frequency domain. To leverage this principle, we first introduce a Frequency Ratio Metric (FRM) to quantify modality preference by analyzing features in the frequency domain. Guided by FRM, we then propose a Multimodal Weight Allocation Module, a plug-and-play component that dynamically re-balances the contribution of each branch during training, promoting a more holistic learning paradigm. Extensive experiments demonstrate that MWAM can be seamlessly integrated into diverse architectural backbones, such as those based on CNNs and ViTs. Furthermore, MWAM delivers consistent performance gains across a wide range of tasks and modality combinations. This advancement extends beyond merely optimizing the performance of the base model; it also manifests as further performance improvements to state-of-the-art methods addressing the missing modality problem.
Paper Structure (37 sections, 28 equations, 10 figures, 22 tables, 1 algorithm)

This paper contains 37 sections, 28 equations, 10 figures, 22 tables, 1 algorithm.

Figures (10)

  • Figure 1: Impact of different frequency components on multimodal model performance. (a): Training loss curves. (b): Validation loss curves. (c): Validation accuracy curves. The numbers in the legend represent the window sizes of the filters, and "Raw dataset" denotes the original dataset without any filtering applied. More experimental details can be found in Appendix \ref{['sec:A.5']}.
  • Figure 2: Architecture and application of our proposed MWAM. (a): Main structure of the MWAM. (b): FRM bank, designed to handle modality exceptions. Its update mechanism is governed by Eq. \ref{['e1']}. (c): An illustration of the integration of MWAM into a multimodal host model. The calculation rules of FRM follow Eq. \ref{['e3']}, which requires flipping and aligning the high-frequency components.
  • Figure 3: Schematic of training intervention mechanisms. (a) is parameter-free, and (b) need extra lightweight auxiliary heads.
  • Figure 4: Training losses for different interventions
  • Figure 5: Comparison of visualization of filter effects. In the horizontal axis, it can be divided into five parts, which are the spatial and frequency domain images of the input image, two high-pass filter results with different window sizes, and two low-pass filter results with different window sizes.
  • ...and 5 more figures