Table of Contents
Fetching ...

Improving endpoint detection in end-to-end streaming ASR for conversational speech

Anandh C, Karthik Pandia Durai, Jeena Prakash, Manickavela Arumugam, Kadri Hacioglu, S. Pavankumar Dubagunta, Andreas Stolcke, Shankar Venkatesan, Aravind Ganapathiraju

TL;DR

The paper addresses endpoint detection latency and accuracy in streaming transducer-based ASR for conversational speech. It introduces a VAD network that uses encoder embeddings (encNET) or Mel features (melNET) to robustly estimate trailing silence and trigger endpoints, along with an end-of-word (EOW) token and a delay-penalty loss to reduce word fragmentation and emission delay. Key findings show that encoder-based VAD achieves a lower detection error (EER) than Mel-based VAD, and EOW with delay penalty can reach near-oracle WER (approximately 21.4%) under certain TS configurations, improving latency without sacrificing transcript quality. The results, demonstrated on Switchboard with a Zipformer transducer, indicate practical improvements in endpoint precision and latency, offering a scalable approach for live conversational ASR systems.

Abstract

ASR endpointing (EP) plays a major role in delivering a good user experience in products supporting human or artificial agents in human-human/machine conversations. Transducer-based ASR (T-ASR) is an end-to-end (E2E) ASR modelling technique preferred for streaming. A major limitation of T-ASR is delayed emission of ASR outputs, which could lead to errors or delays in EP. Inaccurate EP will cut the user off while speaking, returning incomplete transcript while delays in EP will increase the perceived latency, degrading the user experience. We propose methods to improve EP by addressing delayed emission along with EP mistakes. To address the delayed emission problem, we introduce an end-of-word token at the end of each word, along with a delay penalty. The EP delay is addressed by obtaining a reliable frame-level speech activity detection using an auxiliary network. We apply the proposed methods on Switchboard conversational speech corpus and evaluate it against a delay penalty method.

Improving endpoint detection in end-to-end streaming ASR for conversational speech

TL;DR

The paper addresses endpoint detection latency and accuracy in streaming transducer-based ASR for conversational speech. It introduces a VAD network that uses encoder embeddings (encNET) or Mel features (melNET) to robustly estimate trailing silence and trigger endpoints, along with an end-of-word (EOW) token and a delay-penalty loss to reduce word fragmentation and emission delay. Key findings show that encoder-based VAD achieves a lower detection error (EER) than Mel-based VAD, and EOW with delay penalty can reach near-oracle WER (approximately 21.4%) under certain TS configurations, improving latency without sacrificing transcript quality. The results, demonstrated on Switchboard with a Zipformer transducer, indicate practical improvements in endpoint precision and latency, offering a scalable approach for live conversational ASR systems.

Abstract

ASR endpointing (EP) plays a major role in delivering a good user experience in products supporting human or artificial agents in human-human/machine conversations. Transducer-based ASR (T-ASR) is an end-to-end (E2E) ASR modelling technique preferred for streaming. A major limitation of T-ASR is delayed emission of ASR outputs, which could lead to errors or delays in EP. Inaccurate EP will cut the user off while speaking, returning incomplete transcript while delays in EP will increase the perceived latency, degrading the user experience. We propose methods to improve EP by addressing delayed emission along with EP mistakes. To address the delayed emission problem, we introduce an end-of-word token at the end of each word, along with a delay penalty. The EP delay is addressed by obtaining a reliable frame-level speech activity detection using an auxiliary network. We apply the proposed methods on Switchboard conversational speech corpus and evaluate it against a delay penalty method.

Paper Structure

This paper contains 11 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Illustration of false EP, missed EP, and delayed EP while using blank symbols for endpointing. Here, 6 frames of blank tokens are required to trigger an endpoint.
  • Figure 2: Block diagram of proposed endpointing method
  • Figure 3: C1 and C2 are the possible endpointing cases where the joiner output is emitted before ($t_2$) or after ($t_3$) the VAD decision ($t_1+\delta$), respectively. Here $\delta$ is the trailing silence (after speech ended) required to trigger an endpoint. $t_1+\delta$ and $t_3$ are the endpoints of the proposed method; $t_1+\delta$ and $t_3+\delta$ are the endpoints of the baseline for the cases C1 and C2, respectively.
  • Figure 4: Detection error trade-off curves for encNET and melNET. The EER for encNET and melNET are 0.105 and 0.182, respectively.
  • Figure 5: Latency vs WER for different systems for various trailing silence configurations (400ms, 600ms, and 800ms)