Improving Streaming Speech Recognition With Time-Shifted Contextual Attention And Dynamic Right Context Masking
Khanh Le, Duc Chau
TL;DR
This work tackles the latency-accuracy trade-off in streaming ASR by introducing Time-Shifted Contextual Attention (TSCA) to create and refine in-context future information during decoding, and Dynamic Right Context masking (DRC) to expose the model to variable future contexts during training. A Dynamic Chunk Convolution with Lookahead further enables flexible use of right-context with decoding chunks. Together with a low-latency streaming pipeline, TSCA and DRC achieve substantial improvements on Librispeech (up to 13.9% relative WER reduction) while preserving batch processing and keeping user-perceived latency low (RTF < 1). The approach is demonstrated on a Conformer-based encoder with a CTC-AED framework, showing robust gains across configurations and confirming the practical value of leveraging future context in streaming ASR for real-world applications.
Abstract
Chunk-based inference stands out as a popular approach in developing real-time streaming speech recognition, valued for its simplicity and efficiency. However, because it restricts the model's focus to only the history and current chunk context, it may result in performance degradation in scenarios that demand consideration of future context. Addressing this, we propose a novel approach featuring Time-Shifted Contextual Attention (TSCA) and Dynamic Right Context (DRC) masking. Our method shows a relative word error rate reduction of 10 to 13.9% on the Librispeech dataset with the inclusion of in-context future information provided by TSCA. Moreover, we present a streaming automatic speech recognition pipeline that facilitates the integration of TSCA with minimal user-perceived latency, while also enabling batch processing capability, making it practical for various applications.
