Table of Contents
Fetching ...

DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer ASR

Goeric Huybrechts, Srikanth Ronanki, Xilai Li, Hadis Nosrati, Sravan Bodapati, Katrin Kirchhoff

TL;DR

The paper identifies a persistent gap in unified ASR between streaming with limited past context and full-context non-streaming performance. It introduces DCTX-Conformer, which adds a dynamic contextual carry-over mechanism that leverages both the left context of a chunk and multiple preceding context embeddings via a dynamic attention mask, within a unified Conformer trained with dynamic chunking. They demonstrate improvements across chunk sizes, left-context configurations, and number of context embeddings, achieving an average 25.0% relative WER reduction over a non-contextual baseline while keeping latency negligible. On LibriSpeech and diverse test sets, DCTX-Conformer narrows the streaming gap and approaches or surpasses SOTA in several settings, illustrating practical benefits for low-latency ASR.

Abstract

Conformer-based end-to-end models have become ubiquitous these days and are commonly used in both streaming and non-streaming automatic speech recognition (ASR). Techniques like dual-mode and dynamic chunk training helped unify streaming and non-streaming systems. However, there remains a performance gap between streaming with a full and limited past context. To address this issue, we propose the integration of a novel dynamic contextual carry-over mechanism in a state-of-the-art (SOTA) unified ASR system. Our proposed dynamic context Conformer (DCTX-Conformer) utilizes a non-overlapping contextual carry-over mechanism that takes into account both the left context of a chunk and one or more preceding context embeddings. We outperform the SOTA by a relative 25.0% word error rate, with a negligible latency impact due to the additional context embeddings.

DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer ASR

TL;DR

The paper identifies a persistent gap in unified ASR between streaming with limited past context and full-context non-streaming performance. It introduces DCTX-Conformer, which adds a dynamic contextual carry-over mechanism that leverages both the left context of a chunk and multiple preceding context embeddings via a dynamic attention mask, within a unified Conformer trained with dynamic chunking. They demonstrate improvements across chunk sizes, left-context configurations, and number of context embeddings, achieving an average 25.0% relative WER reduction over a non-contextual baseline while keeping latency negligible. On LibriSpeech and diverse test sets, DCTX-Conformer narrows the streaming gap and approaches or surpasses SOTA in several settings, illustrating practical benefits for low-latency ASR.

Abstract

Conformer-based end-to-end models have become ubiquitous these days and are commonly used in both streaming and non-streaming automatic speech recognition (ASR). Techniques like dual-mode and dynamic chunk training helped unify streaming and non-streaming systems. However, there remains a performance gap between streaming with a full and limited past context. To address this issue, we propose the integration of a novel dynamic contextual carry-over mechanism in a state-of-the-art (SOTA) unified ASR system. Our proposed dynamic context Conformer (DCTX-Conformer) utilizes a non-overlapping contextual carry-over mechanism that takes into account both the left context of a chunk and one or more preceding context embeddings. We outperform the SOTA by a relative 25.0% word error rate, with a negligible latency impact due to the additional context embeddings.
Paper Structure (14 sections, 6 equations, 4 figures, 3 tables)

This paper contains 14 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Contextual carry-over mechanism using non-overlapping chunks.
  • Figure 2: Contextual carry-over mask. Illustration for 4 non-overlapping chunks of size 4 with left context size set to 1 chunk. Non-white squares = 1, white squares = 0. Light gray = frame to frame dependency, dark gray = context embedding involved. Orange squares represent the context carried over from past context embeddings in the encoder layers $n > 1$.
  • Figure 3: WER in function of left context (ms) for a chunk size of 640ms without look-ahead frames for model without and with context carry-over.
  • Figure 4: WER in function of number of context embeddings for a chunk size of 640ms without look-ahead frames and with left context of 0ms and 1280ms. For 0 context embeddings, we consider the baseline model without context carry-over.