Table of Contents
Fetching ...

ITEACH-Net: Inverted Teacher-studEnt seArCH Network for Emotion Recognition in Conversation

Haiyang Sun, Zheng Lian, Chenglong Wang, Kang Chen, Licai Sun, Bin Liu, Jianhua Tao

TL;DR

This work tackles emotion recognition in conversations (ERC) under incomplete multimodal data, addressing long-range context and modality dropouts. It introduces ECCE to model local and global emotion-context changes and ITS to decouple learning from complete versus incomplete data, with NAS aiding the student in handling missing data. A novel unified evaluation under dynamic missing-rate scenarios demonstrates ITEACH-Net's robustness, outperforming state-of-the-art baselines across three ERC datasets. The approach offers a robust route for incomplete multimodal learning in ERC and provides a framework potentially applicable to broader multimodal tasks.

Abstract

There remain two critical challenges that hinder the development of ERC. Firstly, there is a lack of exploration into mining deeper insights from the data itself for conversational emotion tasks. Secondly, the systems exhibit vulnerability to random modality feature missing, which is a common occurrence in realistic settings. Focusing on these two key challenges, we propose a novel framework for incomplete multimodal learning in ERC, called "Inverted Teacher-studEnt seArCH Network (ITEACH-Net)." ITEACH-Net comprises two novel components: the Emotion Context Changing Encoder (ECCE) and the Inverted Teacher-Student (ITS) framework. Specifically, leveraging the tendency for emotional states to exhibit local stability within conversational contexts, ECCE captures these patterns and further perceives their evolution over time. Recognizing the varying challenges of handling incomplete versus complete data, ITS employs a teacher-student framework to decouple the respective computations. Subsequently, through Neural Architecture Search, the student model develops enhanced computational capabilities for handling incomplete data compared to the teacher model. During testing, we design a novel evaluation method, testing the model's performance under different missing rate conditions without altering the model weights. We conduct experiments on three benchmark ERC datasets, and the results demonstrate that our ITEACH-Net outperforms existing methods in incomplete multimodal ERC. We believe ITEACH-Net can inspire relevant research on the intrinsic nature of emotions within conversation scenarios and pave a more robust route for incomplete learning techniques. Codes will be made available.

ITEACH-Net: Inverted Teacher-studEnt seArCH Network for Emotion Recognition in Conversation

TL;DR

This work tackles emotion recognition in conversations (ERC) under incomplete multimodal data, addressing long-range context and modality dropouts. It introduces ECCE to model local and global emotion-context changes and ITS to decouple learning from complete versus incomplete data, with NAS aiding the student in handling missing data. A novel unified evaluation under dynamic missing-rate scenarios demonstrates ITEACH-Net's robustness, outperforming state-of-the-art baselines across three ERC datasets. The approach offers a robust route for incomplete multimodal learning in ERC and provides a framework potentially applicable to broader multimodal tasks.

Abstract

There remain two critical challenges that hinder the development of ERC. Firstly, there is a lack of exploration into mining deeper insights from the data itself for conversational emotion tasks. Secondly, the systems exhibit vulnerability to random modality feature missing, which is a common occurrence in realistic settings. Focusing on these two key challenges, we propose a novel framework for incomplete multimodal learning in ERC, called "Inverted Teacher-studEnt seArCH Network (ITEACH-Net)." ITEACH-Net comprises two novel components: the Emotion Context Changing Encoder (ECCE) and the Inverted Teacher-Student (ITS) framework. Specifically, leveraging the tendency for emotional states to exhibit local stability within conversational contexts, ECCE captures these patterns and further perceives their evolution over time. Recognizing the varying challenges of handling incomplete versus complete data, ITS employs a teacher-student framework to decouple the respective computations. Subsequently, through Neural Architecture Search, the student model develops enhanced computational capabilities for handling incomplete data compared to the teacher model. During testing, we design a novel evaluation method, testing the model's performance under different missing rate conditions without altering the model weights. We conduct experiments on three benchmark ERC datasets, and the results demonstrate that our ITEACH-Net outperforms existing methods in incomplete multimodal ERC. We believe ITEACH-Net can inspire relevant research on the intrinsic nature of emotions within conversation scenarios and pave a more robust route for incomplete learning techniques. Codes will be made available.
Paper Structure (35 sections, 15 equations, 15 figures, 5 tables, 1 algorithm)

This paper contains 35 sections, 15 equations, 15 figures, 5 tables, 1 algorithm.

Figures (15)

  • Figure 1: The overall structure of Inverted Teacher-studEnt seArCH Network (ITEACH-Net) with the trimodal setting. The Inverted Teacher-Student framework employs a complex student model to learn from a simple teacher model. The Emotion Context Changing Encoder (ECCE) captures the intricate context information within conversations.
  • Figure 2: The computational modules within the Teacher Model and the Student Model.
  • Figure 3: In conversations, the speakers' states tend to maintain a relatively stable pattern within the local context. As the context change, this pattern evolves.
  • Figure 4: When there are three Token Mixers, the Router's search process involves different utterance features selecting different operations. The parameters for Token Mixer i are shared.
  • Figure 5: Heatmaps generated under different model settings. (a) illustrates the attention map computed by vanilla Self-Attention. (b) illustrates ECCE without encoding local emotion context. (c) illustrates ECCE generated through complete two-stage encoding.
  • ...and 10 more figures