Table of Contents
Fetching ...

Speculative End-Turn Detector for Efficient Speech Chatbot Assistant

Hyunjong Ok, Suho Yoo, Jaeho Lee

TL;DR

This work tackles end-turn detection in LLM-powered spoken dialogue systems, where distinguishing between a completed turn and a hesitation is crucial for natural interaction. It introduces the ETD dataset, the first public resource for ETD, combining synthetic TTS-generated data and real-world conversations to support training and evaluation. It then proposes SpeculativeETD, a collaborative two-stage framework that uses a lightweight on-device GRU for fast SU vs non-SU detection and a server-side Wav2vec 2.0 for fine-grained Gap vs Pause classification, achieving real-time performance with large reductions in computation. Experiments show SpeculativeETD can approach the accuracy of a full Wav2vec 2.0 system while delivering substantial FLOPs savings and sub-millisecond on-device latency, enabling practical deployment in resource-constrained settings. The work further provides open-source data and reproducible methodology to advance ETD research in real-world spoken dialogue applications.

Abstract

Spoken dialogue systems powered by large language models have demonstrated remarkable abilities in understanding human speech and generating appropriate spoken responses. However, these systems struggle with end-turn detection (ETD) -- the ability to distinguish between user turn completion and hesitation. This limitation often leads to premature or delayed responses, disrupting the flow of spoken conversations. In this paper, we introduce the ETD Dataset, the first public dataset for end-turn detection. The ETD dataset consists of both synthetic speech data generated with text-to-speech models and real-world speech data collected from web sources. We also propose SpeculativeETD, a novel collaborative inference framework that balances efficiency and accuracy to improve real-time ETD in resource-constrained environments. Our approach jointly employs a lightweight GRU-based model, which rapidly detects the non-speaking units in real-time on local devices, and a high-performance Wav2vec-based model running on the server to make a more challenging classification of distinguishing turn ends from mere pauses. Experiments demonstrate that the proposed SpeculativeETD significantly improves ETD accuracy while keeping the required computations low. Datasets and code will be available after the review.

Speculative End-Turn Detector for Efficient Speech Chatbot Assistant

TL;DR

This work tackles end-turn detection in LLM-powered spoken dialogue systems, where distinguishing between a completed turn and a hesitation is crucial for natural interaction. It introduces the ETD dataset, the first public resource for ETD, combining synthetic TTS-generated data and real-world conversations to support training and evaluation. It then proposes SpeculativeETD, a collaborative two-stage framework that uses a lightweight on-device GRU for fast SU vs non-SU detection and a server-side Wav2vec 2.0 for fine-grained Gap vs Pause classification, achieving real-time performance with large reductions in computation. Experiments show SpeculativeETD can approach the accuracy of a full Wav2vec 2.0 system while delivering substantial FLOPs savings and sub-millisecond on-device latency, enabling practical deployment in resource-constrained settings. The work further provides open-source data and reproducible methodology to advance ETD research in real-world spoken dialogue applications.

Abstract

Spoken dialogue systems powered by large language models have demonstrated remarkable abilities in understanding human speech and generating appropriate spoken responses. However, these systems struggle with end-turn detection (ETD) -- the ability to distinguish between user turn completion and hesitation. This limitation often leads to premature or delayed responses, disrupting the flow of spoken conversations. In this paper, we introduce the ETD Dataset, the first public dataset for end-turn detection. The ETD dataset consists of both synthetic speech data generated with text-to-speech models and real-world speech data collected from web sources. We also propose SpeculativeETD, a novel collaborative inference framework that balances efficiency and accuracy to improve real-time ETD in resource-constrained environments. Our approach jointly employs a lightweight GRU-based model, which rapidly detects the non-speaking units in real-time on local devices, and a high-performance Wav2vec-based model running on the server to make a more challenging classification of distinguishing turn ends from mere pauses. Experiments demonstrate that the proposed SpeculativeETD significantly improves ETD accuracy while keeping the required computations low. Datasets and code will be available after the review.

Paper Structure

This paper contains 29 sections, 1 equation, 3 figures, 15 tables.

Figures (3)

  • Figure 1: An example of a failure case of end-turn detection during a voice chat with GPT-4o.
  • Figure 2: An overview of our data generation methodology. (a) illustrates the synthetic data pipeline, where text-based dialogue data is converted into three types of speech variations using a text-to-speech (TTS) system. (b) the real data processing pipeline involves collecting and processing speech data from online sources.
  • Figure 3: An illustration of our SpeculativeETD method. A lightweight model (a 1M-parameter GRU in this example) operates as the on-device model, enabling real-time processing. A high-performance model (a 94M-parameter Wav2Vec 2.0 base model in this case) serves as the server-side model, verifying the predictions of the on-device model.