Table of Contents
Fetching ...

C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Chengqian Ma, Wei Tao, Yiwen Guo

TL;DR

This paper introduces $C^3$, a bilingual benchmark for evaluating spoken dialogue systems on complex conversations, addressing five phenomena: phonological ambiguity, semantic ambiguity, omission, coreference, and multi-turn interaction. It builds $Cdata$, a bilingual corpus of 1,079 English/Chinese instances with 1,586 audio-text pairs, and pairs it with an automatic LLM-based evaluation method that aligns with human judgments. Ten end-to-end spoken dialogue systems are assessed, revealing notable cross-language differences (English generally easier than Chinese) and phenomenon-specific difficulties, with omission and semantic ambiguity posing particular challenges in Chinese. The work provides a practical, language-aware framework for assessing and guiding the development of more robust, cross-linguistic spoken dialogue technologies and outlines future expansion to additional languages.

Abstract

Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.

C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

TL;DR

This paper introduces , a bilingual benchmark for evaluating spoken dialogue systems on complex conversations, addressing five phenomena: phonological ambiguity, semantic ambiguity, omission, coreference, and multi-turn interaction. It builds , a bilingual corpus of 1,079 English/Chinese instances with 1,586 audio-text pairs, and pairs it with an automatic LLM-based evaluation method that aligns with human judgments. Ten end-to-end spoken dialogue systems are assessed, revealing notable cross-language differences (English generally easier than Chinese) and phenomenon-specific difficulties, with omission and semantic ambiguity posing particular challenges in Chinese. The work provides a practical, language-aware framework for assessing and guiding the development of more robust, cross-linguistic spoken dialogue technologies and outlines future expansion to additional languages.

Abstract

Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.

Paper Structure

This paper contains 39 sections, 13 figures, 9 tables.

Figures (13)

  • Figure 1: The structure and exemplars within the dataset. The subplots correspond to the sub-datasets of five phenomena. The blue boxes enclose the input for SDM, with some parts of the prompts omitted, while the corresponding outputs are within dashed boxes. Blue underlined text indicates the focal elements of interest, and gray text represents a segment of the prompt. The arrow indicates a rising or falling intonation. The (?) denotes an omitted sentence component. The >> points to the referent of the pronoun. The $...$ represents the omitted dialogue.
  • Figure 2: The relation between terms in Section 3.1.1.
  • Figure 3: The structure of the data instance. The blue box contains input data in text and audio format, where blue text is the prompt and black text is the dialogue content being questioned. The dashed box contains the reference output, with the underlined portion highlighting the key element. "[PAUSE]" represents the pause in the audio.
  • Figure 4: Radar charts depicting the accuracies of each SDM on the English subset of Cdata.
  • Figure 5: Radar charts depicting the accuracies of each SDM on the Chinese subset of Cdata.
  • ...and 8 more figures