Table of Contents
Fetching ...

Mutual Information-based Representations Disentanglement for Unaligned Multimodal Language Sequences

Fan Qian, Jiqing Han, Jianchen Li, Yongjun He, Tieran Zheng, Guibin Zheng

TL;DR

This work tackles the challenge of unaligned multimodal sentiment analysis by introducing MIRD, a framework that learns a single modality-agnostic representation while disentangling modality-specific components through mutual information minimization. By employing CLUB-based estimators and variational surrogates, MIRD reduces both linear and nonlinear dependencies among latent factors and between latent factors and inputs, aided by unlabeled data for robust MI estimation and data-driven structure learning. The approach integrates modality-specific encoders, a cross-modal WSA-BERT-based agnostic encoder, reconstruction losses, and a two-layer regressor to predict sentiment, achieving state-of-the-art results on CMU-MOSI and CMU-MOSEI and demonstrating improved generalization and interpretability. The findings highlight the importance of nonlinear decorrelation and unlabeled data in disentangled multimodal representations, with practical implications for robust, scalable multimodal sentiment analysis.

Abstract

The key challenge in unaligned multimodal language sequences lies in effectively integrating information from various modalities to obtain a refined multimodal joint representation. Recently, the disentangle and fuse methods have achieved the promising performance by explicitly learning modality-agnostic and modality-specific representations and then fusing them into a multimodal joint representation. However, these methods often independently learn modality-agnostic representations for each modality and utilize orthogonal constraints to reduce linear correlations between modality-agnostic and modality-specific representations, neglecting to eliminate their nonlinear correlations. As a result, the obtained multimodal joint representation usually suffers from information redundancy, leading to overfitting and poor generalization of the models. In this paper, we propose a Mutual Information-based Representations Disentanglement (MIRD) method for unaligned multimodal language sequences, in which a novel disentanglement framework is designed to jointly learn a single modality-agnostic representation. In addition, the mutual information minimization constraint is employed to ensure superior disentanglement of representations, thereby eliminating information redundancy within the multimodal joint representation. Furthermore, the challenge of estimating mutual information caused by the limited labeled data is mitigated by introducing unlabeled data. Meanwhile, the unlabeled data also help to characterize the underlying structure of multimodal data, consequently further preventing overfitting and enhancing the performance of the models. Experimental results on several widely used benchmark datasets validate the effectiveness of our proposed approach.

Mutual Information-based Representations Disentanglement for Unaligned Multimodal Language Sequences

TL;DR

This work tackles the challenge of unaligned multimodal sentiment analysis by introducing MIRD, a framework that learns a single modality-agnostic representation while disentangling modality-specific components through mutual information minimization. By employing CLUB-based estimators and variational surrogates, MIRD reduces both linear and nonlinear dependencies among latent factors and between latent factors and inputs, aided by unlabeled data for robust MI estimation and data-driven structure learning. The approach integrates modality-specific encoders, a cross-modal WSA-BERT-based agnostic encoder, reconstruction losses, and a two-layer regressor to predict sentiment, achieving state-of-the-art results on CMU-MOSI and CMU-MOSEI and demonstrating improved generalization and interpretability. The findings highlight the importance of nonlinear decorrelation and unlabeled data in disentangled multimodal representations, with practical implications for robust, scalable multimodal sentiment analysis.

Abstract

The key challenge in unaligned multimodal language sequences lies in effectively integrating information from various modalities to obtain a refined multimodal joint representation. Recently, the disentangle and fuse methods have achieved the promising performance by explicitly learning modality-agnostic and modality-specific representations and then fusing them into a multimodal joint representation. However, these methods often independently learn modality-agnostic representations for each modality and utilize orthogonal constraints to reduce linear correlations between modality-agnostic and modality-specific representations, neglecting to eliminate their nonlinear correlations. As a result, the obtained multimodal joint representation usually suffers from information redundancy, leading to overfitting and poor generalization of the models. In this paper, we propose a Mutual Information-based Representations Disentanglement (MIRD) method for unaligned multimodal language sequences, in which a novel disentanglement framework is designed to jointly learn a single modality-agnostic representation. In addition, the mutual information minimization constraint is employed to ensure superior disentanglement of representations, thereby eliminating information redundancy within the multimodal joint representation. Furthermore, the challenge of estimating mutual information caused by the limited labeled data is mitigated by introducing unlabeled data. Meanwhile, the unlabeled data also help to characterize the underlying structure of multimodal data, consequently further preventing overfitting and enhancing the performance of the models. Experimental results on several widely used benchmark datasets validate the effectiveness of our proposed approach.
Paper Structure (27 sections, 22 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 27 sections, 22 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: The illustration of the existed and our multimodal fusion methods. The subfigure (a), (b), and (c) denote the directly fuse, previous disentangle and fuse methods and our method, respectively. The purple, orange, and blue circles in the middle of subfigure (b) and (c) denote the modality-specific representations learned from visual, language, and audio modalities, respectively. The red circles denote the modality-agnostic representation. The green circles denote the multimodal joint representation.
  • Figure 2: The framework of Mutual Information-based Representations Disentanglement (MIRD). The arrows indicate the forward calculation process. The blue shaded box indicate the Mutual Information Minimization (MIM) module.
  • Figure 3: The diagram of the information constraints between the modality-agnostic representations and the original inputs. The purple and blue regions represent the information contained in the private representations $\mathbf{z}^{V}$ and $\mathbf{z}^{A}$ corresponding to visual and audio modalities, respectively. The orange region represents the information contained in the learned representation $\mathbf{z}^{S}$. The region with vertical stripes represents the genuine shared information between visual and audio modalities. With the mutual information minimization constraints between $\mathbf{z}^{S}$ and $\mathbf{X}^{V}$, as well as $\mathbf{z}^{S}$ and $\mathbf{X}^{A}$, the area of the entire orange region gradually shrinks towards the area of the region with vertical stripes, indicating that $\mathbf{z}^{S}$ only contains shared information.
  • Figure 4: Visualization of modality-agnostic and modality-specific representations on CMU-MOSI test set. The subfigure (a), (b), and (c) denote the No Constraints (NC), Orthogonal Constraints (OC), and Mutual Information Minimization constraint (MIM), respectively. In each subfigure, blue, orange, green, and red points represent language, visual, audio, and shared representations, respectively.
  • Figure 5: The line chart of mutual information estimation on CMU-MOSI dataset. The horizontal axis represents epochs, and the vertical axis represents mutual information estimation values. The blue and orange lines represent the methods without and with unlabeled data, respectively.
  • ...and 3 more figures