VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation

Yancheng Wang; Osama Hanna; Ruiming Xie; Xianfeng Rui; Maohao Shen; Xuedong Zhang; Christian Fuegen; Jilong Wu; Debjyoti Paul; Arthur Guo; Zhihong Lei; Ozlem Kalinli; Qing He; Yingzhen Yang

VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation

Yancheng Wang, Osama Hanna, Ruiming Xie, Xianfeng Rui, Maohao Shen, Xuedong Zhang, Christian Fuegen, Jilong Wu, Debjyoti Paul, Arthur Guo, Zhihong Lei, Ozlem Kalinli, Qing He, Yingzhen Yang

TL;DR

VowelPrompt tackles the challenge of emotion recognition from speech by injecting fine-grained vowel-level prosodic cues into large language model reasoning. It extracts $F_0$ level and slope, $F_0$ variation, RMS energy level and variation, and vowel duration from time-aligned vowel segments, normalizes and discretizes them into textual descriptors, and appends them to transcripts to enable joint lexical-prosodic reasoning. A two-stage training pipeline—supervised fine-tuning (SFT) followed by Reinforcement Learning with Verifiable Reward (RLVR) via Group Relative Policy Optimization (GRPO)—improves reasoning quality and output structure while maintaining generalization. Across five SER benchmarks and multilingual settings, VowelPrompt consistently outperforms transcript-only and sentence-level baselines in zero-shot, fine-tuned, cross-domain, and multilingual scenarios, while offering interpretable, vowel-grounded explanations of its decisions.

Abstract

Emotion recognition in speech presents a complex multimodal challenge, requiring comprehension of both linguistic content and vocal expressivity, particularly prosodic features such as fundamental frequency, intensity, and temporal dynamics. Although large language models (LLMs) have shown promise in reasoning over textual transcriptions for emotion recognition, they typically neglect fine-grained prosodic information, limiting their effectiveness and interpretability. In this work, we propose VowelPrompt, a linguistically grounded framework that augments LLM-based emotion recognition with interpretable, fine-grained vowel-level prosodic cues. Drawing on phonetic evidence that vowels serve as primary carriers of affective prosody, VowelPrompt extracts pitch-, energy-, and duration-based descriptors from time-aligned vowel segments, and converts these features into natural language descriptions for better interpretability. Such a design enables LLMs to jointly reason over lexical semantics and fine-grained prosodic variation. Moreover, we adopt a two-stage adaptation procedure comprising supervised fine-tuning (SFT) followed by Reinforcement Learning with Verifiable Reward (RLVR), implemented via Group Relative Policy Optimization (GRPO), to enhance reasoning capability, enforce structured output adherence, and improve generalization across domains and speaker variations. Extensive evaluations across diverse benchmark datasets demonstrate that VowelPrompt consistently outperforms state-of-the-art emotion recognition methods under zero-shot, fine-tuned, cross-domain, and cross-linguistic conditions, while enabling the generation of interpretable explanations that are jointly grounded in contextual semantics and fine-grained prosodic structure.

VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation

TL;DR

VowelPrompt tackles the challenge of emotion recognition from speech by injecting fine-grained vowel-level prosodic cues into large language model reasoning. It extracts

level and slope,

variation, RMS energy level and variation, and vowel duration from time-aligned vowel segments, normalizes and discretizes them into textual descriptors, and appends them to transcripts to enable joint lexical-prosodic reasoning. A two-stage training pipeline—supervised fine-tuning (SFT) followed by Reinforcement Learning with Verifiable Reward (RLVR) via Group Relative Policy Optimization (GRPO)—improves reasoning quality and output structure while maintaining generalization. Across five SER benchmarks and multilingual settings, VowelPrompt consistently outperforms transcript-only and sentence-level baselines in zero-shot, fine-tuned, cross-domain, and multilingual scenarios, while offering interpretable, vowel-grounded explanations of its decisions.

Abstract

Paper Structure (30 sections, 1 equation, 2 figures, 23 tables)

This paper contains 30 sections, 1 equation, 2 figures, 23 tables.

Introduction
Related Works
Methods
Vowel-Level Acoustic Feature Extraction
Fine-tuning LLM for Emotion Recognition with Vowel-level Acoustic Features
Multilingual Extension with IPA-based Vowel Mapping
Experiments
Datasets
Zero-Shot Emotion Recognition
LLM Fine-Tuning for Emotion Recognition
Cross-Domain Emotion Recognition
Extracting Vowel-Level Acoustic Features from Multilingual Speech
Conclusion
Additional Experiment Results
Ablation Study on Individual Acoustic Features
...and 15 more sections

Figures (2)

Figure 1: An example of the proposed VowelPrompt framework for the emotion recognition task.
Figure 2: Example of a prompt of VowelPrompt combining conversational context, target utterance, and vowel-level prosodic descriptors. The transcript provides lexical content, while each vowel in the target utterance is annotated with interpretable acoustic features, including pitch slope, pitch level and variation, intensity level and variation, and duration. These features are expressed in natural language and integrated into the input to guide the emotion inference by LLM. The example illustrates a frustration-labeled case from IEMOCAP, where prosodic patterns such as high pitch slope and extended vowel duration convey heightened emotional intensity.

VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation

TL;DR

Abstract

VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)