Table of Contents
Fetching ...

Iterative Prototype Refinement for Ambiguous Speech Emotion Recognition

Haoqin Sun, Shiwan Zhao, Xiangyu Kong, Xuechen Wang, Hui Wang, Jiaming Zhou, Yong Qin

TL;DR

This work addresses the challenge of ambiguity in speech emotion recognition by introducing Iterative Prototype Refinement (IPR), a framework that combines class prototypes, prototype updates, and contrastive learning to better represent ambiguous emotions. IPR bootstraps from a small set of precise labels to learn initial prototypes, then iteratively refines them using ambiguous unlabeled data and moving-average updates, guided by pseudo labels. A contrastive objective further sharpens representations, while a composite training loss integrates supervised and self-supervised signals. On the IEMOCAP benchmark, IPR achieves $70.75\%$ accuracy, a $2.00\%$ absolute improvement over prior state-of-the-art, validating its ability to harness ambiguous data for robust emotion recognition and scalable annotation.

Abstract

Recognizing emotions from speech is a daunting task due to the subtlety and ambiguity of expressions. Traditional speech emotion recognition (SER) systems, which typically rely on a singular, precise emotion label, struggle with this complexity. Therefore, modeling the inherent ambiguity of emotions is an urgent problem. In this paper, we propose an iterative prototype refinement framework (IPR) for ambiguous SER. IPR comprises two interlinked components: contrastive learning and class prototypes. The former provides an efficient way to obtain high-quality representations of ambiguous samples. The latter are dynamically updated based on ambiguous labels -- the similarity of the ambiguous data to all prototypes. These refined embeddings yield precise pseudo labels, thus reinforcing representation quality. Experimental evaluations conducted on the IEMOCAP dataset validate the superior performance of IPR over state-of-the-art methods, thus proving the effectiveness of our proposed method.

Iterative Prototype Refinement for Ambiguous Speech Emotion Recognition

TL;DR

This work addresses the challenge of ambiguity in speech emotion recognition by introducing Iterative Prototype Refinement (IPR), a framework that combines class prototypes, prototype updates, and contrastive learning to better represent ambiguous emotions. IPR bootstraps from a small set of precise labels to learn initial prototypes, then iteratively refines them using ambiguous unlabeled data and moving-average updates, guided by pseudo labels. A contrastive objective further sharpens representations, while a composite training loss integrates supervised and self-supervised signals. On the IEMOCAP benchmark, IPR achieves accuracy, a absolute improvement over prior state-of-the-art, validating its ability to harness ambiguous data for robust emotion recognition and scalable annotation.

Abstract

Recognizing emotions from speech is a daunting task due to the subtlety and ambiguity of expressions. Traditional speech emotion recognition (SER) systems, which typically rely on a singular, precise emotion label, struggle with this complexity. Therefore, modeling the inherent ambiguity of emotions is an urgent problem. In this paper, we propose an iterative prototype refinement framework (IPR) for ambiguous SER. IPR comprises two interlinked components: contrastive learning and class prototypes. The former provides an efficient way to obtain high-quality representations of ambiguous samples. The latter are dynamically updated based on ambiguous labels -- the similarity of the ambiguous data to all prototypes. These refined embeddings yield precise pseudo labels, thus reinforcing representation quality. Experimental evaluations conducted on the IEMOCAP dataset validate the superior performance of IPR over state-of-the-art methods, thus proving the effectiveness of our proposed method.
Paper Structure (15 sections, 6 equations, 3 figures, 3 tables)

This paper contains 15 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The overall architecture of IPR, which consists of class prototype learning phase, class prototype updating phase and contrastive learning phase.
  • Figure 2: Similarity between class prototype embeddings during training.
  • Figure 3: Blue (red) represents the agreement rate between model-generated (prototype-assigned) pseudo labels and ground-truth labels.