Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play

Jiatong Shi; Jionghao Han; Yichen Lu; Santiago Pascual; Pengfei Wu; Chenye Cui; Shinji Watanabe; Chao Weng; Cong Zhou

Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play

Jiatong Shi, Jionghao Han, Yichen Lu, Santiago Pascual, Pengfei Wu, Chenye Cui, Shinji Watanabe, Chao Weng, Cong Zhou

TL;DR

Speech-DRAME tackles the challenge of evaluating speech-based role-play by introducing a dual evaluation framework that combines Archetype (top-down) and Realism (bottom-up) perspectives, anchored by human-annotated DRAME-EvalBench and a fine-tuned SEM, DRAME-Eval. Unlike prior ALLM-judge pipelines, it emphasizes evaluation-model alignment and real-world grounding, enabling robust benchmarking of speech foundation models through DRAME-RoleBench. Empirical results show DRAME-Eval outperforms zero-shot and few-shot ALLMs (e.g., average archetype correlation up to 0.629 and realism up to 0.625), while real-recording tests reveal persistent gaps (approx. 0.247 with Instruct-LoRA), underscoring the need for diverse, realistic data. Collectively, Speech-DRAME provides a reproducible foundation to assess and improve spoken role-play systems, guiding both benchmark construction and model development toward human-aligned evaluation.

Abstract

Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current pipelines often use audio large language models (ALLMs) as zero-shot judges, which miss paralinguistic cues, collapse multiple aspects into coarse scores, and rely on synthetic speech references that fail to reflect real-world roles. We present Speech-DRAME, a unified framework that contributes at three levels: (i) Speech-DRAME-EvalBench, an evaluation benchmark with bilingual human-annotated data and protocols for training and testing speech evaluation models (SEMs), (ii) DRAME-Eval, a fine-tuned evaluation model, which substantially outperforms zero-shot and few-shot ALLMs, and (iii) Speech-DRAME-RoleBench, a speech role-play benchmark that leverages DRAME-Eval as an automatic judge to compare speech foundation models (SFMs). Speech-DRAME distinguishes between two complementary evaluation strategies: Archetype Evaluation, a top-down approach measuring adherence to broad role archetypes, and Realism Evaluation, a bottom-up approach grounded in real human speech that emphasizes nuanced role quality. Compared to zero-shot ALLM judges, DRAME-Eval achieves stronger agreement with human ratings (Pearson correlation from 0.480 to 0.629 in archetypes, and 0.390 to 0.625 in realism). By integrating transparent benchmark resources, modeling approaches, and system-level evaluation, Speech-DRAME provides the first comprehensive, reproducible foundation for assessing spoken role-play.

Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play

TL;DR

Abstract

Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (20)