Table of Contents
Fetching ...

Reasoning Beyond Majority Vote: An Explainable SpeechLM Framework for Speech Emotion Recognition

Bo-Hao Su, Hui-Ying Shih, Jinchuan Tian, Jiatong Shi, Chi-Chun Lee, Carlos Busso, Shinji Watanabe

TL;DR

The paper addresses the interpretability gap in speech emotion recognition (SER) by moving beyond majority-vote labels to an explainable SpeechLM that emits transcripts, emotion labels, and natural-language rationales grounded in lexical and acoustic cues. A reasoning-capable teacher LLM generates training rationales used as intermediate supervision for a LoRA-tuned SpeechLM, preserving standard label supervision while enhancing explainability. The approach achieves competitive accuracy on MSP-Podcast v1.12 and shows improved Macro-F1 under both majority-label and annotator-aware evaluations, with human raters and LLM judges preferring the rationale explanations. This work demonstrates a practical path to transparency in SER without sacrificing performance, with potential benefits for annotator-informed evaluation and downstream trust.

Abstract

Speech Emotion Recognition (SER) is typically trained and evaluated on majority-voted labels, which simplifies benchmarking but masks subjectivity and provides little transparency into why predictions are made. This neglects valid minority annotations and limits interpretability. We propose an explainable Speech Language Model (SpeechLM) framework that frames SER as a generative reasoning task. Given an utterance, the model first produces a transcript, then outputs both an emotion label and a concise natural-language rationale grounded in lexical and acoustic cues. Rationales are generated by a reasoning-capable teacher LLM and used as intermediate supervision, combined with majority labels during fine-tuning. Unlike prior work primarily focused on boosting classification accuracy, we aim to enhance explainability while preserving competitive performance. To this end, we complement majority-label metrics with annotator-aware scoring that credits matches with any annotator label. On MSP-Podcast v1.12, our model maintains improvements over zero-shot SpeechLM baselines, and produces rationales that human evaluators find plausible and well grounded. This demonstrates that incorporating rationale supervision offers a practical path toward interpretable SER without sacrificing predictive quality.

Reasoning Beyond Majority Vote: An Explainable SpeechLM Framework for Speech Emotion Recognition

TL;DR

The paper addresses the interpretability gap in speech emotion recognition (SER) by moving beyond majority-vote labels to an explainable SpeechLM that emits transcripts, emotion labels, and natural-language rationales grounded in lexical and acoustic cues. A reasoning-capable teacher LLM generates training rationales used as intermediate supervision for a LoRA-tuned SpeechLM, preserving standard label supervision while enhancing explainability. The approach achieves competitive accuracy on MSP-Podcast v1.12 and shows improved Macro-F1 under both majority-label and annotator-aware evaluations, with human raters and LLM judges preferring the rationale explanations. This work demonstrates a practical path to transparency in SER without sacrificing performance, with potential benefits for annotator-informed evaluation and downstream trust.

Abstract

Speech Emotion Recognition (SER) is typically trained and evaluated on majority-voted labels, which simplifies benchmarking but masks subjectivity and provides little transparency into why predictions are made. This neglects valid minority annotations and limits interpretability. We propose an explainable Speech Language Model (SpeechLM) framework that frames SER as a generative reasoning task. Given an utterance, the model first produces a transcript, then outputs both an emotion label and a concise natural-language rationale grounded in lexical and acoustic cues. Rationales are generated by a reasoning-capable teacher LLM and used as intermediate supervision, combined with majority labels during fine-tuning. Unlike prior work primarily focused on boosting classification accuracy, we aim to enhance explainability while preserving competitive performance. To this end, we complement majority-label metrics with annotator-aware scoring that credits matches with any annotator label. On MSP-Podcast v1.12, our model maintains improvements over zero-shot SpeechLM baselines, and produces rationales that human evaluators find plausible and well grounded. This demonstrates that incorporating rationale supervision offers a practical path toward interpretable SER without sacrificing predictive quality.

Paper Structure

This paper contains 12 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Subjective evaluation results (10 raters).
  • Figure 2: Win-rate comparison by gemini-2.0-flash