MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

He Zhang; Wenqian Cui; Haoning Xu; Xiaohui Li; Lei Zhu; Shaohua Ma; Irwin King

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

He Zhang, Wenqian Cui, Haoning Xu, Xiaohui Li, Lei Zhu, Shaohua Ma, Irwin King

TL;DR

FD-SLMs enable real-time, overlapping speech interactions, but benchmarks largely assess single-round exchanges, neglecting multi-round dynamics. MTR-DuplexBench introduces a turn-segmentation-based framework to enable turn-by-turn evaluation across dialogue quality, conversational features, instruction following, and safety in multi-round settings. A GPT-4o-assisted turn segmentation pipeline, majority voting, and a dedicated assistant-response window address blurred turn boundaries and context drift. Experiments with Moshi reveal noticeable multi-round performance degradation, highlighting the need for robust, multi-dimension evaluation to guide future development of FD-SLMs.

Abstract

Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions and conversational features, neglecting the complexities of multi-round communication and critical capabilities such as instruction following and safety. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark that segments continuous full-duplex dialogues into discrete turns, enabling comprehensive, turn-by-turn evaluation of FD-SLMs across dialogue quality, conversational dynamics, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our proposed benchmark. The benchmark and code will be available in the future.

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

TL;DR

Abstract

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)