Robust Multimodal Learning via Entropy-Gated Contrastive Fusion
Leon Chlon, Maggie Chlon, MarcAntonio M. Awada
TL;DR
This work tackles robust, calibrated multimodal inference under missing modalities by introducing Adaptive Entropy-Gated Contrastive Fusion (AECF), a lightweight fusion layer that operates on frozen encoders. It combines three synergistic modules: a meta-adaptive entropy-regularised gate to prevent modality collapse, Contrastive Expert Calibration (CEC) to enforce monotone confidence across all modality subsets, and Adaptive Curriculum Masking (ACM) to adversarially probe dominant modalities during training. Theoretical guarantees include a worst-case subset regret bound and a PAC-like calibration bound, together with empirical gains on AV-MNIST and MS-COCO, where masking yields up to +18 pp mAP improvements and ECE reductions up to ~200% with minimal runtime overhead. By providing a drop-in fusion layer that preserves accuracy on full inputs while improving robustness and calibration under partial observations, AECF offers a practical path to reliable multimodal deployment with frozen backbones.
Abstract
Real-world multimodal systems routinely face missing-input scenarios, and in reality, robots lose audio in a factory or a clinical record omits lab tests at inference time. Standard fusion layers either preserve robustness or calibration but never both. We introduce Adaptive Entropy-Gated Contrastive Fusion (AECF), a single light-weight layer that (i) adapts its entropy coefficient per instance, (ii) enforces monotone calibration across all modality subsets, and (iii) drives a curriculum mask directly from training-time entropy. On AV-MNIST and MS-COCO, AECF improves masked-input mAP by +18 pp at a 50% drop rate while reducing ECE by up to 200%, yet adds 1% run-time. All back-bones remain frozen, making AECF an easy drop-in layer for robust, calibrated multimodal inference.
