Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

Zhixian Zhao; Xinfa Zhu; Xinsheng Wang; Shuiyuan Wang; Xuelong Geng; Wenjie Tian; Lei Xie

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

Zhixian Zhao, Xinfa Zhu, Xinsheng Wang, Shuiyuan Wang, Xuelong Geng, Wenjie Tian, Lei Xie

TL;DR

This work tackles hallucinations in audio-language model–driven speech emotion recognition by introducing C$^2$SER, which fuses contextual perception (semantic via Whisper and acoustic via Emotion2Vec-S with category-level contrastive loss) with a two-stage chain-of-thought reasoning (Explicit CoT followed by Implicit CoT with self-distillation). It demonstrates that grounding SER in detailed speech content and style, plus structured, self-distilled reasoning, yields more stable and accurate emotion predictions across diverse languages and datasets. The authors also introduce Emo-Emilia, a robust evaluation test set, and release code, checkpoints, and data to foster further research. Overall, C$^2$SER outperforms popular ALMs on multiple benchmarks, reducing hallucination-induced errors while maintaining strong cross-domain generalization.

Abstract

Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose C$^2$SER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). C$^2$SER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, C$^2$SER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability, C$^2$SER introduces self-distillation from explicit CoT to implicit CoT, mitigating error accumulation and boosting recognition accuracy. Extensive experiments show that C$^2$SER outperforms existing popular ALMs, such as Qwen2-Audio and SECap, delivering more stable and precise emotion recognition. We release the training code, checkpoints, and test sets to facilitate further research.

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

TL;DR

This work tackles hallucinations in audio-language model–driven speech emotion recognition by introducing C

SER, which fuses contextual perception (semantic via Whisper and acoustic via Emotion2Vec-S with category-level contrastive loss) with a two-stage chain-of-thought reasoning (Explicit CoT followed by Implicit CoT with self-distillation). It demonstrates that grounding SER in detailed speech content and style, plus structured, self-distilled reasoning, yields more stable and accurate emotion predictions across diverse languages and datasets. The authors also introduce Emo-Emilia, a robust evaluation test set, and release code, checkpoints, and data to foster further research. Overall, C

SER outperforms popular ALMs on multiple benchmarks, reducing hallucination-induced errors while maintaining strong cross-domain generalization.

Abstract

SER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). C

SER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, C

SER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability, C

SER introduces self-distillation from explicit CoT to implicit CoT, mitigating error accumulation and boosting recognition accuracy. Extensive experiments show that C

SER outperforms existing popular ALMs, such as Qwen2-Audio and SECap, delivering more stable and precise emotion recognition. We release the training code, checkpoints, and test sets to facilitate further research.

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

TL;DR

Abstract

Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)