Table of Contents
Fetching ...

EmoSURA: Towards Accurate Evaluation of Detailed and Long-Context Emotional Speech Captions

Xin Jing, Andreas Triantafyllopoulos, Jiadong Wang, Shahin Amiriparian, Jun Luo, Björn Schuller

TL;DR

EmoSURA is proposed, a novel evaluation framework that decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal.

Abstract

Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges often suffer from reasoning inconsistency and context-collapse when processing long-form descriptions. In this work, we propose EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic verification. EmoSURA decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal. Furthermore, we address the scarcity of standardized evaluation resources by introducing SURABench, a carefully balanced and stratified benchmark. Our experiments show that EmoSURA achieves a positive correlation with human judgments, offering a more reliable assessment for long-form captions compared to traditional metrics, which demonstrated negative correlations due to their sensitivity to caption length.

EmoSURA: Towards Accurate Evaluation of Detailed and Long-Context Emotional Speech Captions

TL;DR

EmoSURA is proposed, a novel evaluation framework that decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal.

Abstract

Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges often suffer from reasoning inconsistency and context-collapse when processing long-form descriptions. In this work, we propose EmoSURA, a novel evaluation framework that shifts the paradigm from holistic scoring to atomic verification. EmoSURA decomposes complex captions into Atomic Perceptual Units, which are self-contained statements regarding vocal or emotional attributes, and employs an audio-grounded verification mechanism to validate each unit against the raw speech signal. Furthermore, we address the scarcity of standardized evaluation resources by introducing SURABench, a carefully balanced and stratified benchmark. Our experiments show that EmoSURA achieves a positive correlation with human judgments, offering a more reliable assessment for long-form captions compared to traditional metrics, which demonstrated negative correlations due to their sensitivity to caption length.
Paper Structure (11 sections, 3 equations, 3 figures, 2 tables)

This paper contains 11 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The framework of EmoSURA. It consists of three steps: (1) Decomposition of captions into Atomic Perceptual Units (APUs) using LLMs; (2) Verification of generated APUs against the raw audio using an ALM; and (3) Matching generated APUs with benchmark to assess comprehensiveness.
  • Figure 2: The emotional distribution of SURABench in the Valence-Arousal space. Point colors represent Dominance. The marginal histograms (top and right) demonstrate the uniformity of the dataset.
  • Figure 3: Distribution plots (top) and scatter plots against human ratings (bottom) for baseline metrics versus EMOSURA. Unlike Bleu 4, ROUGE_L, and CIDEr, which show negative correlations (red lines), EMOSURA demonstrates a positive linear relationship (green line) with human ground truth, indicating higher reliability.