Table of Contents
Fetching ...

Dual-branch Graph Domain Adaptation for Cross-scenario Multi-modal Emotion Recognition

Yuntao Shou, Jun Zhou, Tao Meng, Wei Ai, Keqin Li

Abstract

Multimodal Emotion Recognition in Conversations (MERC) aims to predict speakers' emotional states in multi-turn dialogues through text, audio, and visual cues. In real-world settings, conversation scenarios differ significantly in speakers, topics, styles, and noise levels. Existing MERC methods generally neglect these cross-scenario variations, limiting their ability to transfer models trained on a source domain to unseen target domains. To address this issue, we propose a Dual-branch Graph Domain Adaptation framework (DGDA) for multimodal emotion recognition under cross-scenario conditions. We first construct an emotion interaction graph to characterize complex emotional dependencies among utterances. A dual-branch encoder, consisting of a hypergraph neural network (HGNN) and a path neural network (PathNN), is then designed to explicitly model multivariate relationships and implicitly capture global dependencies. To enable out-of-domain generalization, a domain adversarial discriminator is introduced to learn invariant representations across domains. Furthermore, a regularization loss is incorporated to suppress the negative influence of noisy labels. To the best of our knowledge, DGDA is the first MERC framework that jointly addresses domain shift and label noise. Theoretical analysis provides tighter generalization bounds, and extensive experiments on IEMOCAP and MELD demonstrate that DGDA consistently outperforms strong baselines and better adapts to cross-scenario conversations. Our code is available at https://github.com/Xudmm1239439/DGDA-Net.

Dual-branch Graph Domain Adaptation for Cross-scenario Multi-modal Emotion Recognition

Abstract

Multimodal Emotion Recognition in Conversations (MERC) aims to predict speakers' emotional states in multi-turn dialogues through text, audio, and visual cues. In real-world settings, conversation scenarios differ significantly in speakers, topics, styles, and noise levels. Existing MERC methods generally neglect these cross-scenario variations, limiting their ability to transfer models trained on a source domain to unseen target domains. To address this issue, we propose a Dual-branch Graph Domain Adaptation framework (DGDA) for multimodal emotion recognition under cross-scenario conditions. We first construct an emotion interaction graph to characterize complex emotional dependencies among utterances. A dual-branch encoder, consisting of a hypergraph neural network (HGNN) and a path neural network (PathNN), is then designed to explicitly model multivariate relationships and implicitly capture global dependencies. To enable out-of-domain generalization, a domain adversarial discriminator is introduced to learn invariant representations across domains. Furthermore, a regularization loss is incorporated to suppress the negative influence of noisy labels. To the best of our knowledge, DGDA is the first MERC framework that jointly addresses domain shift and label noise. Theoretical analysis provides tighter generalization bounds, and extensive experiments on IEMOCAP and MELD demonstrate that DGDA consistently outperforms strong baselines and better adapts to cross-scenario conversations. Our code is available at https://github.com/Xudmm1239439/DGDA-Net.

Paper Structure

This paper contains 25 sections, 45 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: (a) Common MERC methods. A well-crafted encoder architecture is used to achieve multimodal emotion recognition without considering out-of-domain distribution differences. (b) Our Proposed Dual-branch Graph Domain Adaptation (DGDA) method. DGDA exploits a dual-branch encoder to explicitly and implicitly extract multimodal features, and constructs a domain adversarial alignment strategy and regularization loss to achieve out-of-domain distribution data generalization and resistance to noise label interference.
  • Figure 2: An overview of the proposed DGDA framework. The model operates on a labeled source domain and an unlabeled target domain. In both domains, audio, visual, and text features are first extracted and used to construct utterance-level interaction graphs. A dual-branch graph encoder encodes these graphs. For domain alignment, the source domain is adaptively perturbed by a learned noise $\delta$, while a domain discriminator promotes feature invariance across domains. Meanwhile, category-level alignment is enforced by coupling the dual-branch outputs. The final emotion classifier is trained using source labels and pseudo-labeled target samples.
  • Figure 3: Verify the effectiveness of multimodal features.
  • Figure 4: Hyperparameter sensitivity of threshold $\zeta$ and regularization weight $\lambda$.
  • Figure 5: Confusion matrices for multimodal emotion recognition datasets. The matrices provide insights into the model’s classification accuracy, highlighting the challenges and successes in distinguishing between different emotional categories. Top: Results with varying noise levels on the first dataset setting. Bottom: Results with varying noise levels on the second dataset setting.
  • ...and 1 more figures