Table of Contents
Fetching ...

ReconBoost: Boosting Can Achieve Modality Reconcilement

Cong Hua, Qianqian Xu, Shilong Bao, Zhiyong Yang, Qingming Huang

TL;DR

The paper tackles modality competition in multi-modal learning by introducing ReconBoost, an alternating modality update framework that employs a KL-divergence based reconcilement regularization ($\mathbb{D}_{KL}$) to balance exploiting uni-modal features with cross-modal interactions. The method yields a gradient-boosting-like mechanism, preserving only the latest per-modality learners and incorporating memory consolidation and a global rectification scheme to stabilize training. The authors prove that, with a KL-based regularizer, ReconBoost is equivalent to an alternating form of gradient boosting and demonstrate substantial empirical gains across six public multi-modal benchmarks, including retrieval tasks, while showing robustness to noise. The work enhances both intra-modality feature exploitation and cross-modal synergy, offering a reproducible framework for strengthening multi-modal representations in diverse domains.

Abstract

This paper explores a novel multi-modal alternating learning paradigm pursuing a reconciliation between the exploitation of uni-modal features and the exploration of cross-modal interactions. This is motivated by the fact that current paradigms of multi-modal learning tend to explore multi-modal features simultaneously. The resulting gradient prohibits further exploitation of the features in the weak modality, leading to modality competition, where the dominant modality overpowers the learning process. To address this issue, we study the modality-alternating learning paradigm to achieve reconcilement. Specifically, we propose a new method called ReconBoost to update a fixed modality each time. Herein, the learning objective is dynamically adjusted with a reconcilement regularization against competition with the historical models. By choosing a KL-based reconcilement, we show that the proposed method resembles Friedman's Gradient-Boosting (GB) algorithm, where the updated learner can correct errors made by others and help enhance the overall performance. The major difference with the classic GB is that we only preserve the newest model for each modality to avoid overfitting caused by ensembling strong learners. Furthermore, we propose a memory consolidation scheme and a global rectification scheme to make this strategy more effective. Experiments over six multi-modal benchmarks speak to the efficacy of the method. We release the code at https://github.com/huacong/ReconBoost.

ReconBoost: Boosting Can Achieve Modality Reconcilement

TL;DR

The paper tackles modality competition in multi-modal learning by introducing ReconBoost, an alternating modality update framework that employs a KL-divergence based reconcilement regularization () to balance exploiting uni-modal features with cross-modal interactions. The method yields a gradient-boosting-like mechanism, preserving only the latest per-modality learners and incorporating memory consolidation and a global rectification scheme to stabilize training. The authors prove that, with a KL-based regularizer, ReconBoost is equivalent to an alternating form of gradient boosting and demonstrate substantial empirical gains across six public multi-modal benchmarks, including retrieval tasks, while showing robustness to noise. The work enhances both intra-modality feature exploitation and cross-modal synergy, offering a reproducible framework for strengthening multi-modal representations in diverse domains.

Abstract

This paper explores a novel multi-modal alternating learning paradigm pursuing a reconciliation between the exploitation of uni-modal features and the exploration of cross-modal interactions. This is motivated by the fact that current paradigms of multi-modal learning tend to explore multi-modal features simultaneously. The resulting gradient prohibits further exploitation of the features in the weak modality, leading to modality competition, where the dominant modality overpowers the learning process. To address this issue, we study the modality-alternating learning paradigm to achieve reconcilement. Specifically, we propose a new method called ReconBoost to update a fixed modality each time. Herein, the learning objective is dynamically adjusted with a reconcilement regularization against competition with the historical models. By choosing a KL-based reconcilement, we show that the proposed method resembles Friedman's Gradient-Boosting (GB) algorithm, where the updated learner can correct errors made by others and help enhance the overall performance. The major difference with the classic GB is that we only preserve the newest model for each modality to avoid overfitting caused by ensembling strong learners. Furthermore, we propose a memory consolidation scheme and a global rectification scheme to make this strategy more effective. Experiments over six multi-modal benchmarks speak to the efficacy of the method. We release the code at https://github.com/huacong/ReconBoost.
Paper Structure (37 sections, 2 theorems, 37 equations, 12 figures, 16 tables, 1 algorithm)

This paper contains 37 sections, 2 theorems, 37 equations, 12 figures, 16 tables, 1 algorithm.

Key Result

Theorem 3.1

When the reconcilement regularization satisfies, It leads to equivalent optimization goals:

Figures (12)

  • Figure 1: The performance among multi-modal learning competitors on the CREMA-D dataset. For audio modality and visual modality, we evaluate the encoders of different competitors by training linear classifiers on them. Uni represents the uni-modal training method.
  • Figure 2: The phenomenon of modality competition is observed in the concatenation fusion method when applied to two datasets: CREMA-D with two modalities and MOSEI with three modalities. In CREMA-D, the learning process is primarily influenced by the audio modality, leading to insufficient learning of the visual modality. In MOSEI, the text modality takes control of multi-modal learning, causing challenges in updating the parameters of both the audio and visual modalities.
  • Figure 3: The overview of proposed ReconBoost. In round $s$, we pick up a specific modality learner to update and keep the others fixed. The updated modality learner can correct errors and enhance the overall performance.
  • Figure 4: The visualization of the modality-specific feature among different competitors in the CREMA-D dataset by using the t-SNE method tsne.
  • Figure 5: Quantitative analysis of modality competition. (a) Modality imbalance ratio (MIR) for all competitors on the AVE dataset. (b) The correlation between the DMC in the concatenation fusion method and the improvement of our method is consistent across all datasets.
  • ...and 7 more figures

Theorems & Definitions (4)

  • Theorem 3.1
  • Corollary 3.2
  • proof
  • proof