Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models
Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Jiachen Lian, Tingle Li, Shinji Watanabe, Hung-yi Lee
TL;DR
Full-Duplex-Bench v1.5 addresses overlap handling in real-time, full-duplex speech systems by offering a fully automated, scenario-controlled benchmark that streams user audio to both open-source and API-based models and evaluates responses across $t_{ ext{stop}}$ and $t_{ ext{resp}}$ as well as semantic behavior and prosodic adaptation. It introduces four overlap scenarios—Interruption, Backchannel, Talking to Others, and Background Speech—to elicit distinct strategies, revealing a trade-off between rapid yielding and floor-holding across five agents. The framework supports reproducible evaluation and provides open-source tasks, metrics, and code to accelerate robust development of fluid, socially aware full-duplex dialogue systems. Key findings highlight that fast yielders excel at true interruptions but may mismanage incidental speech, while robust floor holders delay repairs; addressee discrimination and backchannel filtering emerge as critical differentiators among models. Overall, the benchmark offers a practical, reusable tool for diagnosing overlap competence and guiding progress beyond half-duplex paradigms toward more natural, real-time human–machine conversations.
Abstract
Full-duplex spoken dialogue systems promise to transform human-machine interaction from a rigid, turn-based protocol into a fluid, natural conversation. However, the central challenge to realizing this vision, managing overlapping speech, remains critically under-evaluated. We introduce Full-Duplex-Bench v1.5, the first fully automated benchmark designed to systematically probe how models behave during speech overlap. The benchmark simulates four representative overlap scenarios: user interruption, user backchannel, talking to others, and background speech. Our framework, compatible with open-source and commercial API-based models, provides a comprehensive suite of metrics analyzing categorical dialogue behaviors, stop and response latency, and prosodic adaptation. Benchmarking five state-of-the-art agents reveals two divergent strategies: a responsive approach prioritizing rapid response to user input, and a floor-holding approach that preserves conversational flow by filtering overlapping events. Our open-source framework enables practitioners to accelerate the development of robust full-duplex systems by providing the tools for reproducible evaluation
