Predictive Dynamic Fusion

Bing Cao; Yinan Xia; Yi Ding; Changqing Zhang; Qinghua Hu

Predictive Dynamic Fusion

Bing Cao, Yinan Xia, Yi Ding, Changqing Zhang, Qinghua Hu

TL;DR

A Predictive Dynamic Fusion framework for multimodal learning is proposed and theoretically derive the predictable Collaborative Belief (Co-Belief) with Mono- and Holo-Confidence, which provably reduces the upper bound of generalization error.

Abstract

Multimodal fusion is crucial in joint decision-making systems for rendering holistic judgments. Since multimodal data changes in open environments, dynamic fusion has emerged and achieved remarkable progress in numerous applications. However, most existing dynamic multimodal fusion methods lack theoretical guarantees and easily fall into suboptimal problems, yielding unreliability and instability. To address this issue, we propose a Predictive Dynamic Fusion (PDF) framework for multimodal learning. We proceed to reveal the multimodal fusion from a generalization perspective and theoretically derive the predictable Collaborative Belief (Co-Belief) with Mono- and Holo-Confidence, which provably reduces the upper bound of generalization error. Accordingly, we further propose a relative calibration strategy to calibrate the predicted Co-Belief for potential uncertainty. Extensive experiments on multiple benchmarks confirm our superiority. Our code is available at https://github.com/Yinan-Xia/PDF.

Predictive Dynamic Fusion

TL;DR

Abstract

Paper Structure (43 sections, 4 theorems, 31 equations, 7 figures, 10 tables)

This paper contains 43 sections, 4 theorems, 31 equations, 7 figures, 10 tables.

Introduction
Related Works
Theory
Basic Setting
Generalization Error Upper Bound
Collaborative Belief
Mono-Confidence
Holo-Confidence
Co-Belief
Method
Experiments
Setup
Questions to be Verified
Results
Generalization Ability
...and 28 more sections

Key Result

Theorem 3.1

(Generalization Error Upper Bound in Multimodal System). Let $\hat{err}(f^m)$ denotes the empirical errors of the $m$-th modality on $\mathcal{D}_{train}=\{x_i,y_i\}_{i=1}^N$, and $\mathcal{H}$ is hypothesis set i.e., $\mathcal{H}:\mathcal{X}\rightarrow\{-1,+1\}$, where $f\in \mathcal{H}$. $\mathcal

Figures (7)

Figure 1: Our PDF v.s. other fusion methods. We derive from the upper bound of generalization error and predict the Co-Belief for each modality with a theoretical guarantee. The relative calibration calibrates potential uncertainty for more reliable learning. Experiments on different noise levels validate our superiority.
Figure 2: We use confidence predictors to predict the Mono-Confidence of each modality, where the confidence is negatively correlated with the loss of the corresponding modality theoretically. Taking into account the Mono-Confidence of other modalities, we further obtain the Holo-Confidence, where the confidence is positively correlated with the loss of other modalities. By combining Mono-Confidence and Holo-Confidence, we obtain the Co-Belief, which is calibrated as fusion weight to achieve a reduction in the generalization error bounds.
Figure 3: We evaluated the effectiveness of Mono-Confidence, Co-Belief, and Calibrated Co-Belief as fusion weights on the NYU Depth V2 dataset to determine their effectiveness in minimizing the generalization error upper bound. The yellow part of the pie chart in \ref{['figure: GDP']} (a), (b), or (c) illustrates the Generalization Error Bound Decreasing Proportion (GDP) for each weight form under varying noises (0, 5, and 10). As depicted in \ref{['figure: GDP']} (c), the Calibrated Co-Belief attains the highest GDP, leading to the best generalization. \ref{['figure: GDP']} (d) presents the GDP across diverse fusion strategies and noise intensities. More details are given in \ref{['app:lower GEB']}.
Figure 4: We present the true distribution of $p_{true}$ for the samples in UPMC Food 101, according to the blue area in (a), while the red line in (a) is the function curve of loss corresponding to $p_{true}$. In (b), we reported the performance of two prediction methods in various noise conditions. It's obvious that predicting $p_{true}$ yields better performance.
Figure 5: Relative Calibration (RC) can detect noise variations within the current modality as well as in other modalities. The noise ratio denotes the ratio of the noises added to the two modalities, representing the relative exposure of the two modalities to noise. We maintained a fixed noise level for the modality denoted by the blue line in the figure.
...and 2 more figures

Theorems & Definitions (4)

Theorem 3.1
Corollary 3.2
Corollary 3.3
Proposition 1.1

Predictive Dynamic Fusion

TL;DR

Abstract

Predictive Dynamic Fusion

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (4)