Table of Contents
Fetching ...

Towards LLM-Empowered Fine-Grained Speech Descriptors for Explainable Emotion Recognition

Youjun Chen, Xurong Xie, Haoning Xu, Mengzhe Geng, Guinan Li, Chengxi Deng, Huimeng Wang, Shujie Hu, Xunying Liu

TL;DR

The paper tackles explainable SER by predicting both fine-grained speech emotion descriptors and standard SER labels directly from speech. It introduces an end-to-end LLM-empowered architecture that disentangles content and descriptor information from HuBERT SSL features using an IB (VIB) objective and alternating multi-task fine-tuning, plus a VAE-style compression. A Llama-3.1-8B-Instruct decoder with adapters translates latent cues into transcript, descriptors, and labels, trained on a large GigaSpeech subset and evaluated on IEMOCAP, MELD, and EMOSEC. Results show statistically significant improvements in SER unweighted accuracy (up to 4.0% absolute on IEMOCAP and 3.7% on MELD) over strong LLaMA baselines, with the emotion descriptors providing enhanced explainability.

Abstract

This paper presents a novel end-to-end LLM-empowered explainable speech emotion recognition (SER) approach. Fine-grained speech emotion descriptor (SED) features, e.g., pitch, tone and emphasis, are disentangled from HuBERT SSL representations via alternating LLM fine-tuning to joint SER-SED prediction and ASR tasks. VAE compressed HuBERT features derived via Information Bottleneck (IB) are used to adjust feature granularity. Experiments on the IEMOCAP and MELD benchmarks demonstrate that our approach consistently outperforms comparable LLaMA-based SER baselines, including those using either (a) alternating multi-task fine-tuning alone or (b) feature disentanglement only. Statistically significant increase of SER unweighted accuracy by up to 4.0% and 3.7% absolute (5.4% and 6.6% relative) are obtained. More importantly, emotion descriptors offer further explainability for SER.

Towards LLM-Empowered Fine-Grained Speech Descriptors for Explainable Emotion Recognition

TL;DR

The paper tackles explainable SER by predicting both fine-grained speech emotion descriptors and standard SER labels directly from speech. It introduces an end-to-end LLM-empowered architecture that disentangles content and descriptor information from HuBERT SSL features using an IB (VIB) objective and alternating multi-task fine-tuning, plus a VAE-style compression. A Llama-3.1-8B-Instruct decoder with adapters translates latent cues into transcript, descriptors, and labels, trained on a large GigaSpeech subset and evaluated on IEMOCAP, MELD, and EMOSEC. Results show statistically significant improvements in SER unweighted accuracy (up to 4.0% absolute on IEMOCAP and 3.7% on MELD) over strong LLaMA baselines, with the emotion descriptors providing enhanced explainability.

Abstract

This paper presents a novel end-to-end LLM-empowered explainable speech emotion recognition (SER) approach. Fine-grained speech emotion descriptor (SED) features, e.g., pitch, tone and emphasis, are disentangled from HuBERT SSL representations via alternating LLM fine-tuning to joint SER-SED prediction and ASR tasks. VAE compressed HuBERT features derived via Information Bottleneck (IB) are used to adjust feature granularity. Experiments on the IEMOCAP and MELD benchmarks demonstrate that our approach consistently outperforms comparable LLaMA-based SER baselines, including those using either (a) alternating multi-task fine-tuning alone or (b) feature disentanglement only. Statistically significant increase of SER unweighted accuracy by up to 4.0% and 3.7% absolute (5.4% and 6.6% relative) are obtained. More importantly, emotion descriptors offer further explainability for SER.

Paper Structure

This paper contains 18 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Illustration of the proposed LLM-empowered explainable speech emotion recognition approach and prompt template.
  • Figure 2: Illustration of two stages of the whole training process on the two sequential downstream tasks with the alternating multi-task fine-tuning strategy.