Multi-Modal Learning with Bayesian-Oriented Gradient Calibration
Peizheng Guo, Jingyao Wang, Huijie Guo, Jiangmeng Li, Chuxiong Sun, Changwen Zheng, Wenwen Qiang
TL;DR
This work tackles the challenge of gradient fusion in Multi-Modal Learning by explicitly modeling gradient uncertainty across modalities and dimensions. It introduces BOGC-MML, which treats modality gradients as probabilistic objects, derives their distributions via a Laplace/GAUSSIAN framework, and converts gradient precision into Dirichlet evidences that are fused with reduced Dempster–Shafer rules to produce a calibrated update direction. The approach combines Bayesian posterior estimation, Monte Carlo gradient moment matching, and evidential fusion to achieve uncertainty-aware optimization. Experimental results across audio-visual emotion, action recognition, and medical-imaging benchmarks demonstrate improved accuracy and robustness, including in missing-modality scenarios, illustrating the practical impact of uncertainty-calibrated MML training.
Abstract
Multi-Modal Learning (MML) integrates information from diverse modalities to improve predictive accuracy. However, existing methods mainly aggregate gradients with fixed weights and treat all dimensions equally, overlooking the intrinsic gradient uncertainty of each modality. This may lead to (i) excessive updates in sensitive dimensions, degrading performance, and (ii) insufficient updates in less sensitive dimensions, hindering learning. To address this issue, we propose BOGC-MML, a Bayesian-Oriented Gradient Calibration method for MML to explicitly model the gradient uncertainty and guide the model optimization towards the optimal direction. Specifically, we first model each modality's gradient as a random variable and derive its probability distribution, capturing the full uncertainty in the gradient space. Then, we propose an effective method that converts the precision (inverse variance) of each gradient distribution into a scalar evidence. This evidence quantifies the confidence of each modality in every gradient dimension. Using these evidences, we explicitly quantify per-dimension uncertainties and fuse them via a reduced Dempster-Shafer rule. The resulting uncertainty-weighted aggregation produces a calibrated update direction that balances sensitivity and conservatism across dimensions. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness and advantages of the proposed method.
