Angle-Optimized Partial Disentanglement for Multimodal Emotion Recognition in Conversation
Xinyi Che, Wenbo Wang, Yuanbo Hou, Mingjie Xie, Qijun Zhao, Jian Guan
TL;DR
This work tackles Multimodal Emotion Recognition in Conversation (MERC) by addressing the overemphasis on cross-modal shared features and the neglect of modality-specific cues. It introduces AO-FL, a framework consisting of Adaptive Angle Optimization (AAO) and Orthogonal Projection Refinement (OPR) to realize partial disentanglement of shared and modality-specific features within each modality, while maintaining cross-modal semantic alignment. The method leverages cross-modal consistency enhancement and within-modality angular exploration, followed by feature refinement and contextual enhancement, achieving state-of-the-art results on IEMOCAP and MELD and demonstrating generalization to broader MER tasks when paired with diverse unimodal extractors. Overall, AO-FL offers a flexible, generalizable approach to fuse multimodal cues by balancing similarity and complementarity, which improves robustness and interpretability of emotion recognition systems.
Abstract
Multimodal Emotion Recognition in Conversation (MERC) aims to enhance emotion understanding by integrating complementary cues from text, audio, and visual modalities. Existing MERC approaches predominantly focus on cross-modal shared features, often overlooking modality-specific features that capture subtle yet critical emotional cues such as micro-expressions, prosodic variations, and sarcasm. Although related work in multimodal emotion recognition (MER) has explored disentangling shared and modality-specific features, these methods typically employ rigid orthogonal constraints to achieve full disentanglement, which neglects the inherent complementarity between feature types and may limit recognition performance. To address these challenges, we propose Angle-Optimized Feature Learning (AO-FL), a framework tailored for MERC that achieves partial disentanglement of shared and specific features within each modality through adaptive angular optimization. Specifically, AO-FL aligns shared features across modalities to ensure semantic consistency, and within each modality it adaptively models the angular relationship between its shared and modality-specific features to preserve both distinctiveness and complementarity. An orthogonal projection refinement further removes redundancy in specific features and enriches shared features with contextual information, yielding more discriminative multimodal representations. Extensive experiments confirm the effectiveness of AO-FL for MERC, demonstrating superior performance over state-of-the-art approaches. Moreover, AO-FL can be seamlessly integrated with various unimodal feature extractors and extended to other multimodal fusion tasks, such as MER, thereby highlighting its strong generalization beyond MERC.
