Table of Contents
Fetching ...

Two-pass Endpoint Detection for Speech Recognition

Anirudh Raju, Aparna Khare, Di He, Ilya Sklyar, Long Chen, Sam Alptekin, Viet Anh Trinh, Zhe Zhang, Colin Vaz, Venkatesh Ravichandran, Roland Maas, Ariya Rastrow

TL;DR

This work tackles endpoint detection in far-field ASR by introducing a two-pass solution: a first-pass endpoint detector followed by a second-pass EP Arbitrator that gates the initial decision using segment-level acoustic and recognition cues. The Arbitrator computes a postulated EOS probability from AcousticEncoder and TextEncoder embeddings and a lightweight neural classifier, enabling low-latency verification and selective delaying of endpoints. Across public (SLURP) and in-house datasets (transactional, partial, and conversational), the approach improves early endpoint rate and, in some cases, WER, while maintaining or only modestly increasing latency, demonstrating generalization to unseen domains and compatibility with multiple first-pass detectors. The method yields notable gains in EEPR and WER on transactional data, modest EEPR improvements on conversational data, and robust EEPR improvements on SLURP, supporting practical adoption in voice-assistant systems.

Abstract

Endpoint (EP) detection is a key component of far-field speech recognition systems that assist the user through voice commands. The endpoint detector has to trade-off between accuracy and latency, since waiting longer reduces the cases of users being cut-off early. We propose a novel two-pass solution for endpointing, where the utterance endpoint detected from a first pass endpointer is verified by a 2nd-pass model termed EP Arbitrator. Our method improves the trade-off between early cut-offs and latency over a baseline endpointer, as tested on datasets including voice-assistant transactional queries, conversational speech, and the public SLURP corpus. We demonstrate that our method shows improvements regardless of the first-pass EP model used.

Two-pass Endpoint Detection for Speech Recognition

TL;DR

This work tackles endpoint detection in far-field ASR by introducing a two-pass solution: a first-pass endpoint detector followed by a second-pass EP Arbitrator that gates the initial decision using segment-level acoustic and recognition cues. The Arbitrator computes a postulated EOS probability from AcousticEncoder and TextEncoder embeddings and a lightweight neural classifier, enabling low-latency verification and selective delaying of endpoints. Across public (SLURP) and in-house datasets (transactional, partial, and conversational), the approach improves early endpoint rate and, in some cases, WER, while maintaining or only modestly increasing latency, demonstrating generalization to unseen domains and compatibility with multiple first-pass detectors. The method yields notable gains in EEPR and WER on transactional data, modest EEPR improvements on conversational data, and robust EEPR improvements on SLURP, supporting practical adoption in voice-assistant systems.

Abstract

Endpoint (EP) detection is a key component of far-field speech recognition systems that assist the user through voice commands. The endpoint detector has to trade-off between accuracy and latency, since waiting longer reduces the cases of users being cut-off early. We propose a novel two-pass solution for endpointing, where the utterance endpoint detected from a first pass endpointer is verified by a 2nd-pass model termed EP Arbitrator. Our method improves the trade-off between early cut-offs and latency over a baseline endpointer, as tested on datasets including voice-assistant transactional queries, conversational speech, and the public SLURP corpus. We demonstrate that our method shows improvements regardless of the first-pass EP model used.
Paper Structure (20 sections, 2 equations, 4 figures, 2 tables)

This paper contains 20 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: A candidate endpointing decision triggered at frame $t$ from the first-pass EP detector that is verified by the EP Arbitrator that consumes segment-level features of the acoustics and the recognition output
  • Figure 2: EEPRR% and WERR% vs. Average Latency on transactional queries dataset. WERR, EEPRR are computed with respect to the baseline in Table \ref{['tab:arbitrator-main']}
  • Figure 3: EEPRR% vs. Avg Latency on Conversational test set, EEPRR is computed with respect to the conversational data baseline in Table \ref{['tab:arbitrator-main']}
  • Figure 4: EEPRR% vs. Average Latency with different first pass models on transactional queries dataset, EEPRR is computed with respect to the baseline in Table \ref{['tab:arbitrator-main']}