Semantic-Aware Interruption Detection in Spoken Dialogue Systems: Benchmark, Metric, and Model

Kangxiang Xia; Bingshen Mu; Xian Shi; Jin Xu; Lei Xie

Semantic-Aware Interruption Detection in Spoken Dialogue Systems: Benchmark, Metric, and Model

Kangxiang Xia, Bingshen Mu, Xian Shi, Jin Xu, Lei Xie

Abstract

Achieving natural full-duplex interaction in spoken dialogue systems (SDS) remains a challenge due to the difficulty of accurately detecting user interruptions. Current solutions are polarized between "trigger-happy" VAD-based methods that misinterpret backchannels and robust end-to-end models that exhibit unacceptable response delays. Moreover, the absence of real-world benchmarks and holistic metrics hinders progress in the field. This paper presents a comprehensive frame-work to overcome these limitations. We first introduce SID-Bench, the first benchmark for semantic-aware interruption detection built entirely from real-world human dialogues. To provide a rigorous assessment of the responsiveness-robustness trade-off, we propose the Average Penalty Time (APT) metric, which assigns a temporal cost to both false alarms and late responses. Building on this framework, we design an LLM-based detection model optimized through a novel training paradigm to capture subtle semantic cues of intent. Experimental results show that our model significantly outperforms mainstream baselines, achieving a nearly threefold reduction in APT. By successfully resolving the long-standing tension between speed and stability, our work establishes a new state-of-the-art for intelligent interruption handling in SDS. To facilitate future research, SID-Bench and the associated code are available at: https://github.com/xkx-hub/SID-bench.

Semantic-Aware Interruption Detection in Spoken Dialogue Systems: Benchmark, Metric, and Model

Abstract

Paper Structure (18 sections, 3 equations, 3 figures, 3 tables)

This paper contains 18 sections, 3 equations, 3 figures, 3 tables.

Introduction
SID-bench
Data Collection and Sources
Definition and Annotation of Interruption Events
Statistics of SID-Bench
Evaluation Metrics
Primary Metrics: FIR and IRL
Composite Score: APT
Methodology
Model Architecture
Training Paradigm
Inference Process
Experiments
Baselines
Evaluation
...and 3 more sections

Figures (3)

Figure 1: The semi-automated annotation pipeline for SID-Bench. The process begins with raw audio and its transcription. In the Annotation stage, the audio is processed by the Kaldi toolkit for forced alignment to obtain word-level timestamps, while the text is analyzed by a series of LLMs to semantically identify the interruption point, marked with a < break> tag. In the final Information Fusion stage, the semantic < break> tag is aligned with the precise start time of the corresponding word from Kaldi, establishing a semantically meaningful and temporally accurate ground-truth Break_point.
Figure 2: Illustration of the four evaluation scenarios and their associated time penalties. The user's utterance contains a ground-truth interruption intent marked by < break> . (a) True Positive: The system correctly stops after the < break> point. The penalty is IRL, shown in blue. (b) False Positive: The system incorrectly stops in response to a backchannel before the < break> . This is a catastrophic failure, and the penalty, shown in red, applies to the entire turn's duration. (c) False Negative: The system fails to stop, forcing the user to listen to superfluous speech. The penalty, shown in red, is the duration of this unwanted audio. (d) True Negative: The system correctly ignores a backchannel and continues speaking when no < break> is present, thus incurring zero penalty.
Figure 3: The overall architecture of the proposed SID-model for real-time interruption detection. (1) During training, we use audio with a pre-labeled user interruption point. This audio is randomly cropped into clips of varying lengths, represented by different colors. Each clip is assigned a ground-truth label: 'Y', interrupt, if its endpoint is after the user interruption point, and 'N' otherwise. (2) The audio clips are processed by an Audio Encoder (AuT) to extract features. (3) The resulting audio frame sequences are fed into a LLM (Qwen3-0.6b). (4) The model sequentially predicts whether to issue an interrupt signal or not, enabling the system to stop its speech in a timely manner. The user interruption points are annotated following the procedure of the SID-bench.

Semantic-Aware Interruption Detection in Spoken Dialogue Systems: Benchmark, Metric, and Model

Abstract

Semantic-Aware Interruption Detection in Spoken Dialogue Systems: Benchmark, Metric, and Model

Authors

Abstract

Table of Contents

Figures (3)