Table of Contents
Fetching ...

Angle-Optimized Partial Disentanglement for Multimodal Emotion Recognition in Conversation

Xinyi Che, Wenbo Wang, Yuanbo Hou, Mingjie Xie, Qijun Zhao, Jian Guan

TL;DR

This work tackles Multimodal Emotion Recognition in Conversation (MERC) by addressing the overemphasis on cross-modal shared features and the neglect of modality-specific cues. It introduces AO-FL, a framework consisting of Adaptive Angle Optimization (AAO) and Orthogonal Projection Refinement (OPR) to realize partial disentanglement of shared and modality-specific features within each modality, while maintaining cross-modal semantic alignment. The method leverages cross-modal consistency enhancement and within-modality angular exploration, followed by feature refinement and contextual enhancement, achieving state-of-the-art results on IEMOCAP and MELD and demonstrating generalization to broader MER tasks when paired with diverse unimodal extractors. Overall, AO-FL offers a flexible, generalizable approach to fuse multimodal cues by balancing similarity and complementarity, which improves robustness and interpretability of emotion recognition systems.

Abstract

Multimodal Emotion Recognition in Conversation (MERC) aims to enhance emotion understanding by integrating complementary cues from text, audio, and visual modalities. Existing MERC approaches predominantly focus on cross-modal shared features, often overlooking modality-specific features that capture subtle yet critical emotional cues such as micro-expressions, prosodic variations, and sarcasm. Although related work in multimodal emotion recognition (MER) has explored disentangling shared and modality-specific features, these methods typically employ rigid orthogonal constraints to achieve full disentanglement, which neglects the inherent complementarity between feature types and may limit recognition performance. To address these challenges, we propose Angle-Optimized Feature Learning (AO-FL), a framework tailored for MERC that achieves partial disentanglement of shared and specific features within each modality through adaptive angular optimization. Specifically, AO-FL aligns shared features across modalities to ensure semantic consistency, and within each modality it adaptively models the angular relationship between its shared and modality-specific features to preserve both distinctiveness and complementarity. An orthogonal projection refinement further removes redundancy in specific features and enriches shared features with contextual information, yielding more discriminative multimodal representations. Extensive experiments confirm the effectiveness of AO-FL for MERC, demonstrating superior performance over state-of-the-art approaches. Moreover, AO-FL can be seamlessly integrated with various unimodal feature extractors and extended to other multimodal fusion tasks, such as MER, thereby highlighting its strong generalization beyond MERC.

Angle-Optimized Partial Disentanglement for Multimodal Emotion Recognition in Conversation

TL;DR

This work tackles Multimodal Emotion Recognition in Conversation (MERC) by addressing the overemphasis on cross-modal shared features and the neglect of modality-specific cues. It introduces AO-FL, a framework consisting of Adaptive Angle Optimization (AAO) and Orthogonal Projection Refinement (OPR) to realize partial disentanglement of shared and modality-specific features within each modality, while maintaining cross-modal semantic alignment. The method leverages cross-modal consistency enhancement and within-modality angular exploration, followed by feature refinement and contextual enhancement, achieving state-of-the-art results on IEMOCAP and MELD and demonstrating generalization to broader MER tasks when paired with diverse unimodal extractors. Overall, AO-FL offers a flexible, generalizable approach to fuse multimodal cues by balancing similarity and complementarity, which improves robustness and interpretability of emotion recognition systems.

Abstract

Multimodal Emotion Recognition in Conversation (MERC) aims to enhance emotion understanding by integrating complementary cues from text, audio, and visual modalities. Existing MERC approaches predominantly focus on cross-modal shared features, often overlooking modality-specific features that capture subtle yet critical emotional cues such as micro-expressions, prosodic variations, and sarcasm. Although related work in multimodal emotion recognition (MER) has explored disentangling shared and modality-specific features, these methods typically employ rigid orthogonal constraints to achieve full disentanglement, which neglects the inherent complementarity between feature types and may limit recognition performance. To address these challenges, we propose Angle-Optimized Feature Learning (AO-FL), a framework tailored for MERC that achieves partial disentanglement of shared and specific features within each modality through adaptive angular optimization. Specifically, AO-FL aligns shared features across modalities to ensure semantic consistency, and within each modality it adaptively models the angular relationship between its shared and modality-specific features to preserve both distinctiveness and complementarity. An orthogonal projection refinement further removes redundancy in specific features and enriches shared features with contextual information, yielding more discriminative multimodal representations. Extensive experiments confirm the effectiveness of AO-FL for MERC, demonstrating superior performance over state-of-the-art approaches. Moreover, AO-FL can be seamlessly integrated with various unimodal feature extractors and extended to other multimodal fusion tasks, such as MER, thereby highlighting its strong generalization beyond MERC.

Paper Structure

This paper contains 24 sections, 18 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: An example from the MELD poria-etal-2019-meld dataset to illustrate specific features. For instance, the semantics of fear can be reflected in specific features, such as furrowed brows and a downturned mouth in the visual modality, or a quick and rising tone in the audio modality.
  • Figure 2: Overview of the proposed Angle-Optimized Feature Learning (AO-FL) framework for MERC. (a) Overall Architecture: a baseline unimodal feature extractor (MultiEMO shi2023multiemo), followed by AO-FL, which comprises Adaptive Angle Optimization (AAO) and Orthogonal Projection Refinement (OPR) for partial disentanglement, and the final classifier. The subscript $m$ ($m \in {a, t, v}$) denotes the audio, text, or visual modality. $\bowtie$ denotes concatenation. (b) Partial Disentanglement detail: Shared and specific features from each modality are adjusted to optimize angles ($\theta_{a,i}$,$\theta_{t,i}$,$\theta_{v,i}$) within $0 ^\circ$ to $180^\circ$. Three losses jointly guide angle optimization and feature separation, where $\mathcal{L}_{CEN}$ aligns shared features across modalities, $\mathcal{L}_{AAC}$ adaptively optimizes the angle between shared and specific features, and $\mathcal{L}_{CSR}$ ensures the learned angle is sufficiently large to distinguish shared and specific features. $\odot$ denotes cosine similarity calculation.
  • Figure 3: Illustration of cross-model consistency enhancement, where shared features (i.e., $\mathbf{g}_{s, i}$ and $\mathbf{g}_{m, i}$) from different modalities ($s \neq m$) within the same utterance are pulled together as positive pairs, while those from different utterances ($i \neq j$) with mismatched emotion labels (i.e., $\mathbf{g}_{k, j}$) are pushed apart as negative pairs. ($\{s, m$ and $k\in \{a, t, v\}\}$)
  • Figure 4: Illustration of the cosine similarity ranking criterion, where $\mathbf{h}_{m,i}$ denotes the specific feature and $\mathbf{g}_{s,i}$, $\mathbf{g}_{m,i}$ are the shared features. The criterion enforces that the angle between the same utterance's shared features from different modalities ($\phi_{m,i}$) is smaller than that between the specific and shared features of a specific modalities ($\theta_{m,i}$), i.e., $\theta_{m,i} > \phi_{m,i}$.
  • Figure 5: Visualizations of partially disentangled shared and specific features of text, audio and visual modalities on IEMOCAP dataset obtained by our AO-FL and three variants of our AO-FL, i.e., (a) AO-FL (w/o -AAO), (b) AO-FL (w/o -CEN), (c) AO-FL (w/o -ARE).
  • ...and 2 more figures