Table of Contents
Fetching ...

Distribution-based Emotion Recognition in Conversation

Wen Wu, Chao Zhang, Philip C. Woodland

TL;DR

Distribution-based ERC frames each utterance as a probability distribution over emotion classes and models dialogue as a sequence of distributions. It combines a dialogue-level Transformer with utterance-specific Dirichlet priors via a Dirichlet Prior Network (DPN) and utilizes SSL-based audio-text representations for robust multi-modal cues. The approach enables using all utterances, improves uncertainty estimation (AUPR) and achieves higher accuracy than single-utterance baselines on IEMOCAP. This work advances emotion-aware conversational AI by explicitly modelling uncertainty and cross-utterance dynamics.

Abstract

Automatic emotion recognition in conversation (ERC) is crucial for emotion-aware conversational artificial intelligence. This paper proposes a distribution-based framework that formulates ERC as a sequence-to-sequence problem for emotion distribution estimation. The inherent ambiguity of emotions and the subjectivity of human perception lead to disagreements in emotion labels, which is handled naturally in our framework from the perspective of uncertainty estimation in emotion distributions. A Bayesian training loss is introduced to improve the uncertainty estimation by conditioning each emotional state on an utterance-specific Dirichlet prior distribution. Experimental results on the IEMOCAP dataset show that ERC outperformed the single-utterance-based system, and the proposed distribution-based ERC methods have not only better classification accuracy, but also show improved uncertainty estimation.

Distribution-based Emotion Recognition in Conversation

TL;DR

Distribution-based ERC frames each utterance as a probability distribution over emotion classes and models dialogue as a sequence of distributions. It combines a dialogue-level Transformer with utterance-specific Dirichlet priors via a Dirichlet Prior Network (DPN) and utilizes SSL-based audio-text representations for robust multi-modal cues. The approach enables using all utterances, improves uncertainty estimation (AUPR) and achieves higher accuracy than single-utterance baselines on IEMOCAP. This work advances emotion-aware conversational AI by explicitly modelling uncertainty and cross-utterance dynamics.

Abstract

Automatic emotion recognition in conversation (ERC) is crucial for emotion-aware conversational artificial intelligence. This paper proposes a distribution-based framework that formulates ERC as a sequence-to-sequence problem for emotion distribution estimation. The inherent ambiguity of emotions and the subjectivity of human perception lead to disagreements in emotion labels, which is handled naturally in our framework from the perspective of uncertainty estimation in emotion distributions. A Bayesian training loss is introduced to improve the uncertainty estimation by conditioning each emotional state on an utterance-specific Dirichlet prior distribution. Experimental results on the IEMOCAP dataset show that ERC outperformed the single-utterance-based system, and the proposed distribution-based ERC methods have not only better classification accuracy, but also show improved uncertainty estimation.
Paper Structure (25 sections, 12 equations, 5 figures, 2 tables)

This paper contains 25 sections, 12 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Schematic of the proposed distribution-based ERC system with a Transformer, where $\mathbf{x}_n$ is an utterance and $\mathbf{y}_n$ is the corresponding emotion distribution. Bilinear pooling is used to fuse the audio and text features derived from W2V2 and BERT.
  • Figure 2: Schematic of bilinear pooling with shortcut combination. $\odot$ represents the Hadamard product and $\oplus$ represents the element-wise addition of vectors.
  • Figure 3: Distribution of data in IEMOCAP. (a) Proportion of annotators agreeing on the label. (b) Ground-truth of utterances with unique majority labels.
  • Figure 4: Entropy of the predicted emotion distribution of each utterance in a sub-dialogue. The DPN-KL system trained on Session 1-4 was used. For each sentence, the bar chart shows the soft label and the line on the bar chart shows the prediction. Labels provided by the three annotators are shown in the grey box, with "a1" referring to the first annotator etc. ("frustrated" and "confused" are merged into the 5-th class "others").
  • Figure 5: PR curves of the three systems using (a) Max.P and (b) Ent. as the uncertainty measures. The tests were performed on Session 5.