Table of Contents
Fetching ...

Beyond Silent Letters: Amplifying LLMs in Emotion Recognition with Vocal Nuances

Zehui Wu, Ziwei Gong, Lin Ai, Pengyuan Shi, Kaan Donbekci, Julia Hirschberg

TL;DR

SpeechCueLLM enables multimodal emotion recognition by translating speech prosody into natural language descriptions embedded in LLM prompts, avoiding architectural changes. It achieves notable improvements on IEMOCAP (over a 2-point gain in average weighted F1) and competitive results on MELD, while matching SOTA performance with significantly fewer trainable parameters. The work demonstrates the effectiveness of prompt-based, description-driven fusion of audio and text, highlights the role of audio quality, and shows broad applicability across open-source LLMs with LoRA fine-tuning. Practical impact includes a simple, efficient pathway to improve emotion recognition in real-world, resource-constrained settings, with clear directions for handling noisy audio and extending to additional modalities.

Abstract

Emotion recognition in speech is a challenging multimodal task that requires understanding both verbal content and vocal nuances. This paper introduces a novel approach to emotion detection using Large Language Models (LLMs), which have demonstrated exceptional capabilities in natural language understanding. To overcome the inherent limitation of LLMs in processing audio inputs, we propose SpeechCueLLM, a method that translates speech characteristics into natural language descriptions, allowing LLMs to perform multimodal emotion analysis via text prompts without any architectural changes. Our method is minimal yet impactful, outperforming baseline models that require structural modifications. We evaluate SpeechCueLLM on two datasets: IEMOCAP and MELD, showing significant improvements in emotion recognition accuracy, particularly for high-quality audio data. We also explore the effectiveness of various feature representations and fine-tuning strategies for different LLMs. Our experiments demonstrate that incorporating speech descriptions yields a more than 2% increase in the average weighted F1 score on IEMOCAP (from 70.111% to 72.596%).

Beyond Silent Letters: Amplifying LLMs in Emotion Recognition with Vocal Nuances

TL;DR

SpeechCueLLM enables multimodal emotion recognition by translating speech prosody into natural language descriptions embedded in LLM prompts, avoiding architectural changes. It achieves notable improvements on IEMOCAP (over a 2-point gain in average weighted F1) and competitive results on MELD, while matching SOTA performance with significantly fewer trainable parameters. The work demonstrates the effectiveness of prompt-based, description-driven fusion of audio and text, highlights the role of audio quality, and shows broad applicability across open-source LLMs with LoRA fine-tuning. Practical impact includes a simple, efficient pathway to improve emotion recognition in real-world, resource-constrained settings, with clear directions for handling noisy audio and extending to additional modalities.

Abstract

Emotion recognition in speech is a challenging multimodal task that requires understanding both verbal content and vocal nuances. This paper introduces a novel approach to emotion detection using Large Language Models (LLMs), which have demonstrated exceptional capabilities in natural language understanding. To overcome the inherent limitation of LLMs in processing audio inputs, we propose SpeechCueLLM, a method that translates speech characteristics into natural language descriptions, allowing LLMs to perform multimodal emotion analysis via text prompts without any architectural changes. Our method is minimal yet impactful, outperforming baseline models that require structural modifications. We evaluate SpeechCueLLM on two datasets: IEMOCAP and MELD, showing significant improvements in emotion recognition accuracy, particularly for high-quality audio data. We also explore the effectiveness of various feature representations and fine-tuning strategies for different LLMs. Our experiments demonstrate that incorporating speech descriptions yields a more than 2% increase in the average weighted F1 score on IEMOCAP (from 70.111% to 72.596%).
Paper Structure (29 sections, 4 figures, 13 tables)

This paper contains 29 sections, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Example of speech characteristic description (top) and derived impression (bottom). Each color represents the same set of features ( pitch, volume, and speaking rate).
  • Figure 2: SpeechCueLLM Prompt Template for Emotion Detection: the last bold sentence with an underline represents the target utterance. The orange section highlights the outputs with added speech descriptions. This structured template integrates both textual context and speech characteristics to guide the LLM in performing multimodal emotion analysis.
  • Figure 3: LLM Prompt Template for speech-feature-only Emotion Detection.
  • Figure 4: The baseline model structure involves projecting speech encoder features directly into the LLM embedding space.