Table of Contents
Fetching ...

Improving Streaming Speech Recognition With Time-Shifted Contextual Attention And Dynamic Right Context Masking

Khanh Le, Duc Chau

TL;DR

This work tackles the latency-accuracy trade-off in streaming ASR by introducing Time-Shifted Contextual Attention (TSCA) to create and refine in-context future information during decoding, and Dynamic Right Context masking (DRC) to expose the model to variable future contexts during training. A Dynamic Chunk Convolution with Lookahead further enables flexible use of right-context with decoding chunks. Together with a low-latency streaming pipeline, TSCA and DRC achieve substantial improvements on Librispeech (up to 13.9% relative WER reduction) while preserving batch processing and keeping user-perceived latency low (RTF < 1). The approach is demonstrated on a Conformer-based encoder with a CTC-AED framework, showing robust gains across configurations and confirming the practical value of leveraging future context in streaming ASR for real-world applications.

Abstract

Chunk-based inference stands out as a popular approach in developing real-time streaming speech recognition, valued for its simplicity and efficiency. However, because it restricts the model's focus to only the history and current chunk context, it may result in performance degradation in scenarios that demand consideration of future context. Addressing this, we propose a novel approach featuring Time-Shifted Contextual Attention (TSCA) and Dynamic Right Context (DRC) masking. Our method shows a relative word error rate reduction of 10 to 13.9% on the Librispeech dataset with the inclusion of in-context future information provided by TSCA. Moreover, we present a streaming automatic speech recognition pipeline that facilitates the integration of TSCA with minimal user-perceived latency, while also enabling batch processing capability, making it practical for various applications.

Improving Streaming Speech Recognition With Time-Shifted Contextual Attention And Dynamic Right Context Masking

TL;DR

This work tackles the latency-accuracy trade-off in streaming ASR by introducing Time-Shifted Contextual Attention (TSCA) to create and refine in-context future information during decoding, and Dynamic Right Context masking (DRC) to expose the model to variable future contexts during training. A Dynamic Chunk Convolution with Lookahead further enables flexible use of right-context with decoding chunks. Together with a low-latency streaming pipeline, TSCA and DRC achieve substantial improvements on Librispeech (up to 13.9% relative WER reduction) while preserving batch processing and keeping user-perceived latency low (RTF < 1). The approach is demonstrated on a Conformer-based encoder with a CTC-AED framework, showing robust gains across configurations and confirming the practical value of leveraging future context in streaming ASR for real-world applications.

Abstract

Chunk-based inference stands out as a popular approach in developing real-time streaming speech recognition, valued for its simplicity and efficiency. However, because it restricts the model's focus to only the history and current chunk context, it may result in performance degradation in scenarios that demand consideration of future context. Addressing this, we propose a novel approach featuring Time-Shifted Contextual Attention (TSCA) and Dynamic Right Context (DRC) masking. Our method shows a relative word error rate reduction of 10 to 13.9% on the Librispeech dataset with the inclusion of in-context future information provided by TSCA. Moreover, we present a streaming automatic speech recognition pipeline that facilitates the integration of TSCA with minimal user-perceived latency, while also enabling batch processing capability, making it practical for various applications.

Paper Structure

This paper contains 11 sections, 5 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: (a) represents the conventional chunk mask and (b) is our proposal dynamic right context mask. The areas in yellow, green, and blue are the left context $l$, chunk $c$, and right context $r$, respectively. Here, $l = 3$, $c = 3$ and $r = 2$, in frame units.
  • Figure 2: Streaming ASR with future context pipeline design.
  • Figure 3: WER comparison between different $p$ values with chunk size $c = 10$ on dev clean.