StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control

Haishu Zhao; Aokai Hao; Yuan Ge; Zhenqiang Hong; Tong Xiao; Jingbo Zhu

StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control

Haishu Zhao, Aokai Hao, Yuan Ge, Zhenqiang Hong, Tong Xiao, Jingbo Zhu

Abstract

Speech language models (SLMs) have significantly extended the interactive capability of text-based Large Language Models (LLMs) by incorporating paralinguistic information. For more realistic interactive experience with customized styles, current SLMs have managed to interpret and control speaking style intensity from user prompts during the dialogue process. However, there remains a lack of systematic benchmarks that quantifies and evaluates the style intensity control ability in conversations. In this paper, we propose StyleBench, a multi-turn dialogue benchmark for comprehensively evaluating the style intensity control ability across four dimensions: emotion, speed, volume, and pitch. Our results reveal the performance gaps between leading SLMs and omni language models (OLMs), suggesting the underlying reasons and promising approaches for future exploration.

StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control

Abstract

Paper Structure (13 sections, 1 equation, 2 figures, 4 tables)

This paper contains 13 sections, 1 equation, 2 figures, 4 tables.

Introduction
Benchmark Dataset Construction
Data Samples
Textual Content
Speech Synthesis
Evaluations & Discussions
Style Quantification Metrics
Experimental Setup
Main Results
Discussions
Impact of Data Training
Impact of Speech Tokenizers
Conclusion

Figures (2)

Figure 1: Conversational Speaking Style Controlling Dialogue
Figure 2: An overview of data composition and synthesis process: textual contents are generated with different situational features (emotional or neutral), followed by neutral prompts and stylistic responses synthesized using CosyVoice2. Emotional responses were synthesized using the RAVDESS as the reference audio, while the others were adjusted using FFmpeg.

StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control

Abstract

StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control

Authors

Abstract

Table of Contents

Figures (2)