Table of Contents
Fetching ...

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

Zhixian Zhao, Xinfa Zhu, Xinsheng Wang, Shuiyuan Wang, Xuelong Geng, Wenjie Tian, Lei Xie

TL;DR

This work tackles hallucinations in audio-language model–driven speech emotion recognition by introducing C$^2$SER, which fuses contextual perception (semantic via Whisper and acoustic via Emotion2Vec-S with category-level contrastive loss) with a two-stage chain-of-thought reasoning (Explicit CoT followed by Implicit CoT with self-distillation). It demonstrates that grounding SER in detailed speech content and style, plus structured, self-distilled reasoning, yields more stable and accurate emotion predictions across diverse languages and datasets. The authors also introduce Emo-Emilia, a robust evaluation test set, and release code, checkpoints, and data to foster further research. Overall, C$^2$SER outperforms popular ALMs on multiple benchmarks, reducing hallucination-induced errors while maintaining strong cross-domain generalization.

Abstract

Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose C$^2$SER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). C$^2$SER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, C$^2$SER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability, C$^2$SER introduces self-distillation from explicit CoT to implicit CoT, mitigating error accumulation and boosting recognition accuracy. Extensive experiments show that C$^2$SER outperforms existing popular ALMs, such as Qwen2-Audio and SECap, delivering more stable and precise emotion recognition. We release the training code, checkpoints, and test sets to facilitate further research.

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

TL;DR

This work tackles hallucinations in audio-language model–driven speech emotion recognition by introducing CSER, which fuses contextual perception (semantic via Whisper and acoustic via Emotion2Vec-S with category-level contrastive loss) with a two-stage chain-of-thought reasoning (Explicit CoT followed by Implicit CoT with self-distillation). It demonstrates that grounding SER in detailed speech content and style, plus structured, self-distilled reasoning, yields more stable and accurate emotion predictions across diverse languages and datasets. The authors also introduce Emo-Emilia, a robust evaluation test set, and release code, checkpoints, and data to foster further research. Overall, CSER outperforms popular ALMs on multiple benchmarks, reducing hallucination-induced errors while maintaining strong cross-domain generalization.

Abstract

Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose CSER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). CSER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, CSER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability, CSER introduces self-distillation from explicit CoT to implicit CoT, mitigating error accumulation and boosting recognition accuracy. Extensive experiments show that CSER outperforms existing popular ALMs, such as Qwen2-Audio and SECap, delivering more stable and precise emotion recognition. We release the training code, checkpoints, and test sets to facilitate further research.

Paper Structure

This paper contains 24 sections, 2 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Overview of C$^2$SER. The top path shows how standard models can "hallucinate" by generating irrelevant context. The bottom path demonstrates our two-step Chain-of-Thought (CoT) approach: first generating a detailed rationale (Explicit CoT), and then internalizing this capability for a direct and stable prediction (Implicit CoT).
  • Figure 2: The detailed architecture and two-stage training process of C²SER. Stage 1 (Explicit CoT): The model is trained to generate a step-by-step rationale by combining semantic features from Whisper and acoustic features from Emotion2Vec-S. Stage 2 (Implicit CoT): Through self-distillation, the model is trained to produce a direct emotional description, enhancing efficiency while preserving the reasoning capabilities.
  • Figure 3: The three types of losses used in Emotion2Vec-s. From top to bottom are utterance-level loss, frame-level loss, and category-level loss.
  • Figure 4: Training data emotion distribution: each slice represents a different emotion, with percentages shown.
  • Figure 5: Training data language distribution: each slice represents a different language, with percentages shown.
  • ...and 2 more figures