DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer ASR

Goeric Huybrechts; Srikanth Ronanki; Xilai Li; Hadis Nosrati; Sravan Bodapati; Katrin Kirchhoff

DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer ASR

Goeric Huybrechts, Srikanth Ronanki, Xilai Li, Hadis Nosrati, Sravan Bodapati, Katrin Kirchhoff

TL;DR

The paper identifies a persistent gap in unified ASR between streaming with limited past context and full-context non-streaming performance. It introduces DCTX-Conformer, which adds a dynamic contextual carry-over mechanism that leverages both the left context of a chunk and multiple preceding context embeddings via a dynamic attention mask, within a unified Conformer trained with dynamic chunking. They demonstrate improvements across chunk sizes, left-context configurations, and number of context embeddings, achieving an average 25.0% relative WER reduction over a non-contextual baseline while keeping latency negligible. On LibriSpeech and diverse test sets, DCTX-Conformer narrows the streaming gap and approaches or surpasses SOTA in several settings, illustrating practical benefits for low-latency ASR.

Abstract

Conformer-based end-to-end models have become ubiquitous these days and are commonly used in both streaming and non-streaming automatic speech recognition (ASR). Techniques like dual-mode and dynamic chunk training helped unify streaming and non-streaming systems. However, there remains a performance gap between streaming with a full and limited past context. To address this issue, we propose the integration of a novel dynamic contextual carry-over mechanism in a state-of-the-art (SOTA) unified ASR system. Our proposed dynamic context Conformer (DCTX-Conformer) utilizes a non-overlapping contextual carry-over mechanism that takes into account both the left context of a chunk and one or more preceding context embeddings. We outperform the SOTA by a relative 25.0% word error rate, with a negligible latency impact due to the additional context embeddings.

DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer ASR

TL;DR

Abstract

Paper Structure (14 sections, 6 equations, 4 figures, 3 tables)

This paper contains 14 sections, 6 equations, 4 figures, 3 tables.

Introduction
Approach and related work
End-to-end unified ASR
Dynamic contextual carry-over mechanism
Experimental settings
Datasets
Setup
Results
Performance impact of chunk size
Performance impact of left context size
Performance impact of context embeddings
LibriSpeech comparison with SOTA
Latency study
Conclusions

Figures (4)

Figure 1: Contextual carry-over mechanism using non-overlapping chunks.
Figure 2: Contextual carry-over mask. Illustration for 4 non-overlapping chunks of size 4 with left context size set to 1 chunk. Non-white squares = 1, white squares = 0. Light gray = frame to frame dependency, dark gray = context embedding involved. Orange squares represent the context carried over from past context embeddings in the encoder layers $n > 1$.
Figure 3: WER in function of left context (ms) for a chunk size of 640ms without look-ahead frames for model without and with context carry-over.
Figure 4: WER in function of number of context embeddings for a chunk size of 640ms without look-ahead frames and with left context of 0ms and 1280ms. For 0 context embeddings, we consider the baseline model without context carry-over.

DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer ASR

TL;DR

Abstract

DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer ASR

Authors

TL;DR

Abstract

Table of Contents

Figures (4)