Phoenix-VAD: Streaming Semantic Endpoint Detection for Full-Duplex Speech Interaction
Weijie Wu, Wenhao Guan, Kaidi Wang, Peijie Chen, Zhuanling Zha, Junbo Li, Jun Fang, Lin Li, Qingyang Hong
TL;DR
Phoenix-VAD addresses the need for streaming semantic endpoint detection in full-duplex speech systems by proposing an LLM-based, plug-and-play module that operates with streaming inference. The approach combines a Zipformer audio encoder, a modality-bridging adapter, and a backbone LLM, trained with a sliding window strategy to produce real-time semantic state decisions without modifying the downstream dialogue model. Data is synthetically generated (semantic complete/incomplete text) and paired with speech, with precise stop-speech timestamps and labeled outcomes, enabling robust training and evaluation. Experimental results show competitive performance across semantically complete and incomplete scenarios, with real-time chunk-level inference (~50 ms) and ablations validating design choices such as chunk size and training strategy. This work offers a flexible, reliable pathway to integrate semantic understanding into next-generation, full-duplex human-computer interaction systems.
Abstract
Spoken dialogue models have significantly advanced intelligent human-computer interaction, yet they lack a plug-and-play full-duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix-VAD, an LLM-based model that enables streaming semantic endpoint detection. Specifically, Phoenix-VAD leverages the semantic comprehension capability of the LLM and a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference. Experiments on both semantically complete and incomplete speech scenarios indicate that Phoenix-VAD achieves excellent and competitive performance. Furthermore, this design enables the full-duplex prediction module to be optimized independently of the dialogue model, providing more reliable and flexible support for next-generation human-computer interaction.
