Reliable Multimodal Learning Via Multi-Level Adaptive DeConfusion
Tong Zhang, Shu Shen, C. L. Philip Chen
TL;DR
MLAD tackles the problem of inter-class confusion in multimodal learning under noisy data by introducing a two-level deconfusion strategy. Class-Adaptive Deconfusion (CAD) adaptively uses dynamic-exit encoders and residual cross-class reconstruction to remove global confusion, while Sample-Adaptive Deconfusion (SAD) uses confusion priors and cross-modality rectification to purge sample-specific confusion. The approach yields superior accuracy and robustness across multiple benchmarks and noise conditions, with extensive ablations validating each component. Together, CAD and SAD offer a principled path toward more reliable, high-confidence multimodal predictions in real-world conditions.
Abstract
Multimodal learning enhances the performance of various machine learning tasks by leveraging complementary information across different modalities. However, existing methods often learn multimodal representations that retain substantial inter-class confusion, making it difficult to achieve high-confidence predictions, particularly in real-world scenarios with low-quality or noisy data. To address this challenge, we propose Multi-Level Adaptive DeConfusion (MLAD), which eliminates inter-class confusion in multimodal data at both global and sample levels, significantly enhancing the classification reliability of multimodal models. Specifically, MLAD first learns class-wise latent distributions with global-level confusion removed via dynamic-exit modality encoders that adapt to the varying discrimination difficulty of each class and a cross-class residual reconstruction mechanism. Subsequently, MLAD further removes sample-specific confusion through sample-adaptive cross-modality rectification guided by confusion-free modality priors. These priors are constructed from low-confusion modality features, identified by evaluating feature confusion using the learned class-wise latent distributions and selecting those with low confusion via a Gaussian mixture model. Experiments demonstrate that MLAD outperforms state-of-the-art methods across multiple benchmarks and exhibits superior reliability.
