Table of Contents
Fetching ...

Reliable Multimodal Learning Via Multi-Level Adaptive DeConfusion

Tong Zhang, Shu Shen, C. L. Philip Chen

TL;DR

MLAD tackles the problem of inter-class confusion in multimodal learning under noisy data by introducing a two-level deconfusion strategy. Class-Adaptive Deconfusion (CAD) adaptively uses dynamic-exit encoders and residual cross-class reconstruction to remove global confusion, while Sample-Adaptive Deconfusion (SAD) uses confusion priors and cross-modality rectification to purge sample-specific confusion. The approach yields superior accuracy and robustness across multiple benchmarks and noise conditions, with extensive ablations validating each component. Together, CAD and SAD offer a principled path toward more reliable, high-confidence multimodal predictions in real-world conditions.

Abstract

Multimodal learning enhances the performance of various machine learning tasks by leveraging complementary information across different modalities. However, existing methods often learn multimodal representations that retain substantial inter-class confusion, making it difficult to achieve high-confidence predictions, particularly in real-world scenarios with low-quality or noisy data. To address this challenge, we propose Multi-Level Adaptive DeConfusion (MLAD), which eliminates inter-class confusion in multimodal data at both global and sample levels, significantly enhancing the classification reliability of multimodal models. Specifically, MLAD first learns class-wise latent distributions with global-level confusion removed via dynamic-exit modality encoders that adapt to the varying discrimination difficulty of each class and a cross-class residual reconstruction mechanism. Subsequently, MLAD further removes sample-specific confusion through sample-adaptive cross-modality rectification guided by confusion-free modality priors. These priors are constructed from low-confusion modality features, identified by evaluating feature confusion using the learned class-wise latent distributions and selecting those with low confusion via a Gaussian mixture model. Experiments demonstrate that MLAD outperforms state-of-the-art methods across multiple benchmarks and exhibits superior reliability.

Reliable Multimodal Learning Via Multi-Level Adaptive DeConfusion

TL;DR

MLAD tackles the problem of inter-class confusion in multimodal learning under noisy data by introducing a two-level deconfusion strategy. Class-Adaptive Deconfusion (CAD) adaptively uses dynamic-exit encoders and residual cross-class reconstruction to remove global confusion, while Sample-Adaptive Deconfusion (SAD) uses confusion priors and cross-modality rectification to purge sample-specific confusion. The approach yields superior accuracy and robustness across multiple benchmarks and noise conditions, with extensive ablations validating each component. Together, CAD and SAD offer a principled path toward more reliable, high-confidence multimodal predictions in real-world conditions.

Abstract

Multimodal learning enhances the performance of various machine learning tasks by leveraging complementary information across different modalities. However, existing methods often learn multimodal representations that retain substantial inter-class confusion, making it difficult to achieve high-confidence predictions, particularly in real-world scenarios with low-quality or noisy data. To address this challenge, we propose Multi-Level Adaptive DeConfusion (MLAD), which eliminates inter-class confusion in multimodal data at both global and sample levels, significantly enhancing the classification reliability of multimodal models. Specifically, MLAD first learns class-wise latent distributions with global-level confusion removed via dynamic-exit modality encoders that adapt to the varying discrimination difficulty of each class and a cross-class residual reconstruction mechanism. Subsequently, MLAD further removes sample-specific confusion through sample-adaptive cross-modality rectification guided by confusion-free modality priors. These priors are constructed from low-confusion modality features, identified by evaluating feature confusion using the learned class-wise latent distributions and selecting those with low confusion via a Gaussian mixture model. Experiments demonstrate that MLAD outperforms state-of-the-art methods across multiple benchmarks and exhibits superior reliability.

Paper Structure

This paper contains 37 sections, 18 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Illustration of the motivation of this work. For illustrative purposes, these figures show results on a toy FOOD101 subset containing four classes.
  • Figure 2: An overview of the proposed MLAD (better viewed in colour). MLAD employs class-adaptive deconfusion and sample-adaptive deconfusion to eliminate global-level confusion and sample-level confusion. Without loss of generality, this figure illustrates the case of two modalities, with blue and orange representing different modalities. The detailed implementation of the Rectification module (gray block) in (b-2) Sample-Adaptive Cross-Modality Rectification is illustrated in Fig. \ref{['fig:rec-module']}.
  • Figure 3: Detailed implementation of Rectification $(m_{0}\leftarrow m_{1})$. Rectification $(m^{1}\leftarrow m^{0})$ is obtained by swapping the input positions of $z_i^{m_{0}}$ and $z_i^{m_{1}}$.
  • Figure 4: Illustration of the differences in inter-class separability across modalities and the corresponding output depths determined by the Dynamic-Exit Modality Encoder for different classes. (a) Certain classes exhibit lower overall similarity to other classes (e.g., Class 1 in the mRNA modality), indicating lower discrimination difficulty, whereas others show higher similarity (e.g., Class 2 in mRNA), indicating higher discrimination difficulty. (b) The Dynamic-Exit Modality Encoder adaptively adjusts the output depth according to the discrimination difficulty of each class, enabling classes with lower difficulty to exit from shallower layers, while more complex classes are processed at deeper layers. For example, the Basal-like class (corresponding to Class 1 in (a)) exits at the shallowest layer, whereas the HER2-enriched class (corresponding to Class 2 in (a)) exits at the deepest layer.
  • Figure 5: t-SNE visualizations of the original modality features, the learned representations by the encoders without and with RCCR on the CUB dataset.
  • ...and 5 more figures