Table of Contents
Fetching ...

Multimodal Classification via Total Correlation Maximization

Feng Yu, Xiangyu Wu, Yang Yang, Jianfeng Lu

TL;DR

This paper theoretically analyze modality competition and proposes a method for multimodal classification by maximizing the total correlation between multimodal features and labels, which alleviates modality competition while capturing inter-modal interactions via feature alignment.

Abstract

Multimodal learning integrates data from diverse sensors to effectively harness information from different modalities. However, recent studies reveal that joint learning often overfits certain modalities while neglecting others, leading to performance inferior to that of unimodal learning. Although previous efforts have sought to balance modal contributions or combine joint and unimodal learning, thereby mitigating the degradation of weaker modalities with promising outcomes, few have examined the relationship between joint and unimodal learning from an information-theoretic perspective. In this paper, we theoretically analyze modality competition and propose a method for multimodal classification by maximizing the total correlation between multimodal features and labels. By maximizing this objective, our approach alleviates modality competition while capturing inter-modal interactions via feature alignment. Building on Mutual Information Neural Estimation (MINE), we introduce Total Correlation Neural Estimation (TCNE) to derive a lower bound for total correlation. Subsequently, we present TCMax, a hyperparameter-free loss function that maximizes total correlation through variational bound optimization. Extensive experiments demonstrate that TCMax outperforms state-of-the-art joint and unimodal learning approaches. Our code is available at https://github.com/hubaak/TCMax.

Multimodal Classification via Total Correlation Maximization

TL;DR

This paper theoretically analyze modality competition and proposes a method for multimodal classification by maximizing the total correlation between multimodal features and labels, which alleviates modality competition while capturing inter-modal interactions via feature alignment.

Abstract

Multimodal learning integrates data from diverse sensors to effectively harness information from different modalities. However, recent studies reveal that joint learning often overfits certain modalities while neglecting others, leading to performance inferior to that of unimodal learning. Although previous efforts have sought to balance modal contributions or combine joint and unimodal learning, thereby mitigating the degradation of weaker modalities with promising outcomes, few have examined the relationship between joint and unimodal learning from an information-theoretic perspective. In this paper, we theoretically analyze modality competition and propose a method for multimodal classification by maximizing the total correlation between multimodal features and labels. By maximizing this objective, our approach alleviates modality competition while capturing inter-modal interactions via feature alignment. Building on Mutual Information Neural Estimation (MINE), we introduce Total Correlation Neural Estimation (TCNE) to derive a lower bound for total correlation. Subsequently, we present TCMax, a hyperparameter-free loss function that maximizes total correlation through variational bound optimization. Extensive experiments demonstrate that TCMax outperforms state-of-the-art joint and unimodal learning approaches. Our code is available at https://github.com/hubaak/TCMax.
Paper Structure (44 sections, 10 theorems, 42 equations, 4 figures, 6 tables)

This paper contains 44 sections, 10 theorems, 42 equations, 4 figures, 6 tables.

Key Result

Theorem 1

The mutual information between $Z \in \mathcal{Z}$ and $y \in \mathcal{Y}$ admits the following dual representation: where the supremum is taken over all functions $T$ such that the two expectations are finite. As neural networks $T_\theta$ with parameter $\theta \in \Theta$ compose a family of functions which is a subset of $\mathcal{Z} \times \mathcal{Y} \xrightarrow{} \mathbb{R}$, we have: Fo

Figures (4)

  • Figure 1: Venn graph of an extreme case where the audio encoder has already been well-fitted. The visual component (blue) only needs to cover $I(y;z^{(v)]}|z^{(a)})$ to achieve the training objective ($\mathcal{L}_{joint}\approx0$), therefore ends up being unfitted.
  • Figure 2: An illustration of the relationship between joint learning, unimodal learning, and learning through maximizing the total correlation.
  • Figure 3: Accuracy on different numbers of sampled negative pairs.
  • Figure 4: Train loss and test accuracy of joint learning, unimodal learning, and TCMax on CREMA-D and UCF101 datasets.

Theorems & Definitions (10)

  • Theorem 1: MINE MINE
  • Corollary 1: TCNE
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Corollary 2: Corollary \ref{['TCNE']} restated, TCNE
  • Theorem 2
  • Proposition 4: Proposition 1 restated
  • Proposition 5: Proposition \ref{['prop:TCNE_equality']} restated
  • Proposition 6: Proposition \ref{['TCMax_equality']} restated