Table of Contents
Fetching ...

Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models

Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Jiachen Lian, Tingle Li, Shinji Watanabe, Hung-yi Lee

TL;DR

Full-Duplex-Bench v1.5 addresses overlap handling in real-time, full-duplex speech systems by offering a fully automated, scenario-controlled benchmark that streams user audio to both open-source and API-based models and evaluates responses across $t_{ ext{stop}}$ and $t_{ ext{resp}}$ as well as semantic behavior and prosodic adaptation. It introduces four overlap scenarios—Interruption, Backchannel, Talking to Others, and Background Speech—to elicit distinct strategies, revealing a trade-off between rapid yielding and floor-holding across five agents. The framework supports reproducible evaluation and provides open-source tasks, metrics, and code to accelerate robust development of fluid, socially aware full-duplex dialogue systems. Key findings highlight that fast yielders excel at true interruptions but may mismanage incidental speech, while robust floor holders delay repairs; addressee discrimination and backchannel filtering emerge as critical differentiators among models. Overall, the benchmark offers a practical, reusable tool for diagnosing overlap competence and guiding progress beyond half-duplex paradigms toward more natural, real-time human–machine conversations.

Abstract

Full-duplex spoken dialogue systems promise to transform human-machine interaction from a rigid, turn-based protocol into a fluid, natural conversation. However, the central challenge to realizing this vision, managing overlapping speech, remains critically under-evaluated. We introduce Full-Duplex-Bench v1.5, the first fully automated benchmark designed to systematically probe how models behave during speech overlap. The benchmark simulates four representative overlap scenarios: user interruption, user backchannel, talking to others, and background speech. Our framework, compatible with open-source and commercial API-based models, provides a comprehensive suite of metrics analyzing categorical dialogue behaviors, stop and response latency, and prosodic adaptation. Benchmarking five state-of-the-art agents reveals two divergent strategies: a responsive approach prioritizing rapid response to user input, and a floor-holding approach that preserves conversational flow by filtering overlapping events. Our open-source framework enables practitioners to accelerate the development of robust full-duplex systems by providing the tools for reproducible evaluation

Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models

TL;DR

Full-Duplex-Bench v1.5 addresses overlap handling in real-time, full-duplex speech systems by offering a fully automated, scenario-controlled benchmark that streams user audio to both open-source and API-based models and evaluates responses across and as well as semantic behavior and prosodic adaptation. It introduces four overlap scenarios—Interruption, Backchannel, Talking to Others, and Background Speech—to elicit distinct strategies, revealing a trade-off between rapid yielding and floor-holding across five agents. The framework supports reproducible evaluation and provides open-source tasks, metrics, and code to accelerate robust development of fluid, socially aware full-duplex dialogue systems. Key findings highlight that fast yielders excel at true interruptions but may mismanage incidental speech, while robust floor holders delay repairs; addressee discrimination and backchannel filtering emerge as critical differentiators among models. Overall, the benchmark offers a practical, reusable tool for diagnosing overlap competence and guiding progress beyond half-duplex paradigms toward more natural, real-time human–machine conversations.

Abstract

Full-duplex spoken dialogue systems promise to transform human-machine interaction from a rigid, turn-based protocol into a fluid, natural conversation. However, the central challenge to realizing this vision, managing overlapping speech, remains critically under-evaluated. We introduce Full-Duplex-Bench v1.5, the first fully automated benchmark designed to systematically probe how models behave during speech overlap. The benchmark simulates four representative overlap scenarios: user interruption, user backchannel, talking to others, and background speech. Our framework, compatible with open-source and commercial API-based models, provides a comprehensive suite of metrics analyzing categorical dialogue behaviors, stop and response latency, and prosodic adaptation. Benchmarking five state-of-the-art agents reveals two divergent strategies: a responsive approach prioritizing rapid response to user input, and a floor-holding approach that preserves conversational flow by filtering overlapping events. Our open-source framework enables practitioners to accelerate the development of robust full-duplex systems by providing the tools for reproducible evaluation

Paper Structure

This paper contains 22 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of the evaluation framework for full-duplex speech models. User speech (top) overlaps with model output (bottom) in four controlled scenarios. We analyze the model's post-overlap response across three dimensions: categorical behaviors, interaction timing, and adaptive speech features.
  • Figure 2: Illustration of the four controlled overlap scenarios. User speech (top) and model speech (bottom) share a timeline: (1) Interruption: user barges in with a new request; (2) Backchannel: brief acknowledgment (uh-huh); (3) Talking to Other: user addresses someone else; (4) Background Speech: far-field third-party talk not meant for the model.