Triple Disentangled Representation Learning for Multimodal Affective Analysis

Ying Zhou; Xuefeng Liang; Han Chen; Yin Zhao; Xin Chen; Lida Yu

Triple Disentangled Representation Learning for Multimodal Affective Analysis

Ying Zhou, Xuefeng Liang, Han Chen, Yin Zhao, Xin Chen, Lida Yu

TL;DR

TriDiRA tackles the challenge that modality-specific representations can carry label-irrelevant information in multimodal affective analysis by introducing triple disentanglement: modality-invariant $r^*$, effective modality-specific $r \cap u$, and ineffective modality-specific $u^*$. A dual-output attention mechanism disentangles these components from each modality, and only $r^*$ and $r \cap u$ are fused for prediction, aided by $L_{task}$, $L_{sim}$, $L_{h}$, $L_{ucorr}$, and $L_{recon}$ losses. Central Moment Discrepancy aligns invariant representations across modalities, while HSIC enforces independence between the disentangled parts to prevent leakage of information. Experiments on MOSI, MOSEI, UR-FUNNY, and MELD show TriDiRA achieving SOTA performance, with ablations confirming the critical role of triple disentanglement and the regularizers in enhancing the quality of effective representations for robust multimodal fusion.

Abstract

Multimodal learning has exhibited a significant advantage in affective analysis tasks owing to the comprehensive information of various modalities, particularly the complementary information. Thus, many emerging studies focus on disentangling the modality-invariant and modality-specific representations from input data and then fusing them for prediction. However, our study shows that modality-specific representations may contain information that is irrelevant or conflicting with the tasks, which downgrades the effectiveness of learned multimodal representations. We revisit the disentanglement issue, and propose a novel triple disentanglement approach, TriDiRA, which disentangles the modality-invariant, effective modality-specific and ineffective modality-specific representations from input data. By fusing only the modality-invariant and effective modality-specific representations, TriDiRA can significantly alleviate the impact of irrelevant and conflicting information across modalities during model training. Extensive experiments conducted on four benchmark datasets demonstrate the effectiveness and generalization of our triple disentanglement, which outperforms SOTA methods.

Triple Disentangled Representation Learning for Multimodal Affective Analysis

TL;DR

TriDiRA tackles the challenge that modality-specific representations can carry label-irrelevant information in multimodal affective analysis by introducing triple disentanglement: modality-invariant

, effective modality-specific

, and ineffective modality-specific

. A dual-output attention mechanism disentangles these components from each modality, and only

and

are fused for prediction, aided by

, and

losses. Central Moment Discrepancy aligns invariant representations across modalities, while HSIC enforces independence between the disentangled parts to prevent leakage of information. Experiments on MOSI, MOSEI, UR-FUNNY, and MELD show TriDiRA achieving SOTA performance, with ablations confirming the critical role of triple disentanglement and the regularizers in enhancing the quality of effective representations for robust multimodal fusion.

Abstract

Paper Structure (31 sections, 12 equations, 10 figures, 8 tables)

This paper contains 31 sections, 12 equations, 10 figures, 8 tables.

Introduction
Related work
Multimodal representation learning
Binary disentangled representation learning
Method
Feature Extraction
Disentanglement module
Task Losses
Similarity Loss
Independence Losses
Reconstruction Loss
Experiments
Datasets
Evaluation metrics
Experimental settings
...and 16 more sections

Figures (10)

Figure 1: (a) A sample containing consistent and conflicting information among different modalities. (b) Binary disentangled subspaces of unimodality, containing modality-invariant ($r^*$) and modality-specific subspaces ($u$). (c) Triple disentangled subspaces, disentangling the modality-invariant ($r^*$), effective modality-specific ($r\cap u$), and ineffective modality-specific ($u^*$) subspaces from the label-relevant and modality-specific subspaces.
Figure 2: The flowchart of TriDiRA, includes three modules: feature extraction, feature disentanglement, and feature fusion. The feature extraction module includes three unimodal Transformer encoders and a shared Transformer encoder. The disentanglement module (DS) decomposes the unimodal features into modality-invariant representations, as well as effective and ineffective modality-specific representations. Then, the effective representations are fused using a multi-head attention for prediction. Note that the losses $\mathcal{L}_{h}^{intra}$ and $\mathcal{L}_{recon}$ are also applied for the other two modalities.
Figure 3: (a) The architecture of disentanglement module. (b) The diagram of the dual-output attention module.
Figure 4: The t-SNE visualization of the modality-invariant and modality-specific (effective and ineffective) subspaces of TriDiRA and DMD on the test set of MOSI.
Figure 5: An instance with label 1.6, whose modalities contain conflicting or irrelevant information. And, the performance of representations disentangled by TriDiRA. Blue denotes positive, brown denotes negative, and green denotes irrelevant to sentiment prediction.
...and 5 more figures

Triple Disentangled Representation Learning for Multimodal Affective Analysis

TL;DR

Abstract

Triple Disentangled Representation Learning for Multimodal Affective Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (10)