SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Junyi Ao; Yuancheng Wang; Xiaohai Tian; Dekun Chen; Jun Zhang; Lu Lu; Yuxuan Wang; Haizhou Li; Zhizheng Wu

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu

TL;DR

SD-Eval tackles the gap in evaluating spoken-dialogue systems by incorporating paralinguistic and environmental information into a dedicated benchmark. It introduces four test sub-tasks—emotion, accent, age, and environment—drawn from eight datasets for testing and builds a large, diverse training set from eleven sources, enabling open-source evaluation. The study compares cascaded ASR+LLM, end-to-end speech LLMs, and upper-bound baselines across objective, subjective, and LLM-based metrics, finding that models conditioned on speech cues perform better and that LLM-based judgments align more closely with human ratings. The results underscore the importance of non-content speech cues for realistic dialogue responses and position LLM-based evaluation as a promising tool for open-ended speech-to-text generation, while outlining clear paths for future multi-turn, broader-cue benchmarks.

Abstract

Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information and includes 7,303 utterances, amounting to 8.76 hours of speech data. The data is aggregated from eight public datasets, representing four perspectives: emotion, accent, age, and background sound. To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a process similar to that of SD-Eval. The training set contains 1,052.72 hours of speech data and 724.4k utterances. We also conduct a comprehensive evaluation using objective evaluation methods (e.g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated responses. Models conditioned with paralinguistic and environmental information outperform their counterparts in both objective and subjective measures. Moreover, experiments demonstrate that LLM-based metrics show a higher correlation with human evaluation compared to traditional metrics. We open-source SD-Eval at https://github.com/amphionspace/SD-Eval.

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

TL;DR

Abstract

Paper Structure (44 sections, 18 figures, 5 tables)

This paper contains 44 sections, 18 figures, 5 tables.

Introduction
Related Work
Spoken Conversation Datasets with Paralinguistic Label
Spoken Question Answering
Evaluation Metrics for Open-Ended Generation Tasks
SD-Eval Benchmark Dataset
Dataset Construction
Data Collection
Synthetic Data Generation
Label Normalization
Data Filtering
Punctuation Restoration
Response Generation
Dataset Statistics
Benchmark Experiments
...and 29 more sections

Figures (18)

Figure 1: (a) Information embedded in speech: content, environmental, and paralinguistic information. (b) Examples of spoken dialogue, which illustrate the impact of user emotions, accents, age, and environmental information on the responses.
Figure 2: The prompt for filtering utterances.
Figure 3: The prompt for generating responses of utterances related to emotion.
Figure 4: Pie charts illustrating the data distribution for each category within each subset.
Figure 5: (a) Model Structure of Cascade LLM, which generates text response directly based on the ASR output. (b) Model structure of Vanilla Speech LLM (VS-LLM). The LLM takes speech representation as input, which is generated from a speech encoder and adaptor.
...and 13 more figures

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

TL;DR

Abstract

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Authors

TL;DR

Abstract

Table of Contents

Figures (18)