Table of Contents
Fetching ...

Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play

Jiatong Shi, Jionghao Han, Yichen Lu, Santiago Pascual, Pengfei Wu, Chenye Cui, Shinji Watanabe, Chao Weng, Cong Zhou

TL;DR

Speech-DRAME tackles the challenge of evaluating speech-based role-play by introducing a dual evaluation framework that combines Archetype (top-down) and Realism (bottom-up) perspectives, anchored by human-annotated DRAME-EvalBench and a fine-tuned SEM, DRAME-Eval. Unlike prior ALLM-judge pipelines, it emphasizes evaluation-model alignment and real-world grounding, enabling robust benchmarking of speech foundation models through DRAME-RoleBench. Empirical results show DRAME-Eval outperforms zero-shot and few-shot ALLMs (e.g., average archetype correlation up to 0.629 and realism up to 0.625), while real-recording tests reveal persistent gaps (approx. 0.247 with Instruct-LoRA), underscoring the need for diverse, realistic data. Collectively, Speech-DRAME provides a reproducible foundation to assess and improve spoken role-play systems, guiding both benchmark construction and model development toward human-aligned evaluation.

Abstract

Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current pipelines often use audio large language models (ALLMs) as zero-shot judges, which miss paralinguistic cues, collapse multiple aspects into coarse scores, and rely on synthetic speech references that fail to reflect real-world roles. We present Speech-DRAME, a unified framework that contributes at three levels: (i) Speech-DRAME-EvalBench, an evaluation benchmark with bilingual human-annotated data and protocols for training and testing speech evaluation models (SEMs), (ii) DRAME-Eval, a fine-tuned evaluation model, which substantially outperforms zero-shot and few-shot ALLMs, and (iii) Speech-DRAME-RoleBench, a speech role-play benchmark that leverages DRAME-Eval as an automatic judge to compare speech foundation models (SFMs). Speech-DRAME distinguishes between two complementary evaluation strategies: Archetype Evaluation, a top-down approach measuring adherence to broad role archetypes, and Realism Evaluation, a bottom-up approach grounded in real human speech that emphasizes nuanced role quality. Compared to zero-shot ALLM judges, DRAME-Eval achieves stronger agreement with human ratings (Pearson correlation from 0.480 to 0.629 in archetypes, and 0.390 to 0.625 in realism). By integrating transparent benchmark resources, modeling approaches, and system-level evaluation, Speech-DRAME provides the first comprehensive, reproducible foundation for assessing spoken role-play.

Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play

TL;DR

Speech-DRAME tackles the challenge of evaluating speech-based role-play by introducing a dual evaluation framework that combines Archetype (top-down) and Realism (bottom-up) perspectives, anchored by human-annotated DRAME-EvalBench and a fine-tuned SEM, DRAME-Eval. Unlike prior ALLM-judge pipelines, it emphasizes evaluation-model alignment and real-world grounding, enabling robust benchmarking of speech foundation models through DRAME-RoleBench. Empirical results show DRAME-Eval outperforms zero-shot and few-shot ALLMs (e.g., average archetype correlation up to 0.629 and realism up to 0.625), while real-recording tests reveal persistent gaps (approx. 0.247 with Instruct-LoRA), underscoring the need for diverse, realistic data. Collectively, Speech-DRAME provides a reproducible foundation to assess and improve spoken role-play systems, guiding both benchmark construction and model development toward human-aligned evaluation.

Abstract

Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current pipelines often use audio large language models (ALLMs) as zero-shot judges, which miss paralinguistic cues, collapse multiple aspects into coarse scores, and rely on synthetic speech references that fail to reflect real-world roles. We present Speech-DRAME, a unified framework that contributes at three levels: (i) Speech-DRAME-EvalBench, an evaluation benchmark with bilingual human-annotated data and protocols for training and testing speech evaluation models (SEMs), (ii) DRAME-Eval, a fine-tuned evaluation model, which substantially outperforms zero-shot and few-shot ALLMs, and (iii) Speech-DRAME-RoleBench, a speech role-play benchmark that leverages DRAME-Eval as an automatic judge to compare speech foundation models (SFMs). Speech-DRAME distinguishes between two complementary evaluation strategies: Archetype Evaluation, a top-down approach measuring adherence to broad role archetypes, and Realism Evaluation, a bottom-up approach grounded in real human speech that emphasizes nuanced role quality. Compared to zero-shot ALLM judges, DRAME-Eval achieves stronger agreement with human ratings (Pearson correlation from 0.480 to 0.629 in archetypes, and 0.390 to 0.625 in realism). By integrating transparent benchmark resources, modeling approaches, and system-level evaluation, Speech-DRAME provides the first comprehensive, reproducible foundation for assessing spoken role-play.

Paper Structure

This paper contains 76 sections, 7 equations, 20 figures, 15 tables.

Figures (20)

  • Figure 1: Speech-DRAME formalizes speech role-play with a dual evaluation strategy: Archetype Evaluation (top-down, stereotype-driven) and Realism Evaluation (bottom-up, human-grounded). Built on these, DRAME-EvalBench provides human-annotated data, DRAME-Eval aligns SEMs to perception, and DRAME-RoleBench benchmarks SFMs with automatic judges.
  • Figure 2: Annotation platform for Archetype Evaluation. Subfigure (a) presents the system’s audio visualization and playback interface, while subfigure (b) shows the structured annotation form for rating and reasoning across key perceptual dimensions.
  • Figure 3: System interface for Realism Evaluation. Annotators can visualize spectrograms and waveforms while reviewing contextual metadata such as the scene description, character style, and profile to guide their evaluation.
  • Figure 4: Annotation interface for Realism Evaluation. The form guides annotators through fine-grained perceptual judgments, connecting low-level speech delivery features with high-level narrative and character consistency.
  • Figure 5: Model rankings within domains for Appropriateness dimension of Mandarin archetype evaluation (higher is better). Each bar shows the mean score for that domain.
  • ...and 15 more figures