Triple Disentangled Representation Learning for Multimodal Affective Analysis
Ying Zhou, Xuefeng Liang, Han Chen, Yin Zhao, Xin Chen, Lida Yu
TL;DR
TriDiRA tackles the challenge that modality-specific representations can carry label-irrelevant information in multimodal affective analysis by introducing triple disentanglement: modality-invariant $r^*$, effective modality-specific $r \cap u$, and ineffective modality-specific $u^*$. A dual-output attention mechanism disentangles these components from each modality, and only $r^*$ and $r \cap u$ are fused for prediction, aided by $L_{task}$, $L_{sim}$, $L_{h}$, $L_{ucorr}$, and $L_{recon}$ losses. Central Moment Discrepancy aligns invariant representations across modalities, while HSIC enforces independence between the disentangled parts to prevent leakage of information. Experiments on MOSI, MOSEI, UR-FUNNY, and MELD show TriDiRA achieving SOTA performance, with ablations confirming the critical role of triple disentanglement and the regularizers in enhancing the quality of effective representations for robust multimodal fusion.
Abstract
Multimodal learning has exhibited a significant advantage in affective analysis tasks owing to the comprehensive information of various modalities, particularly the complementary information. Thus, many emerging studies focus on disentangling the modality-invariant and modality-specific representations from input data and then fusing them for prediction. However, our study shows that modality-specific representations may contain information that is irrelevant or conflicting with the tasks, which downgrades the effectiveness of learned multimodal representations. We revisit the disentanglement issue, and propose a novel triple disentanglement approach, TriDiRA, which disentangles the modality-invariant, effective modality-specific and ineffective modality-specific representations from input data. By fusing only the modality-invariant and effective modality-specific representations, TriDiRA can significantly alleviate the impact of irrelevant and conflicting information across modalities during model training. Extensive experiments conducted on four benchmark datasets demonstrate the effectiveness and generalization of our triple disentanglement, which outperforms SOTA methods.
