Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition

Mohammad Zeineldeen; Albert Zeyer; Ralf Schlüter; Hermann Ney

Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition

Mohammad Zeineldeen, Albert Zeyer, Ralf Schlüter, Hermann Ney

TL;DR

This work introduces a streamable chunked attention-based encoder–decoder (AED) for streaming speech recognition. By processing fixed-size chunks with an end-of-chunk (EOC) symbol, the model becomes equivalent to a transducer operating on chunks and can be trained and decoded with alignment-synchronous strategies. Experiments on LibriSpeech and TED-LIUM-v2 show that a chunked decoder or a chunked encoder–decoder can achieve competitive WERs compared to a global AED, with strong generalization to long-form speech and favorable latency properties. The results highlight the practicality of chunked attention for streaming ASR and establish useful links to transducer models, especially when incorporating external language models and ILM corrections.

Abstract

We study a streamable attention-based encoder-decoder model in which either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks. A special end-of-chunk (EOC) symbol advances from one chunk to the next chunk, effectively replacing the conventional end-of-sequence symbol. This modification, while minor, situates our model as equivalent to a transducer model that operates on chunks instead of frames, where EOC corresponds to the blank symbol. We further explore the remaining differences between a standard transducer and our model. Additionally, we examine relevant aspects such as long-form speech generalization, beam size, and length normalization. Through experiments on Librispeech and TED-LIUM-v2, and by concatenating consecutive sequences for long-form trials, we find that our streamable model maintains competitive performance compared to the non-streamable variant and generalizes very well to long-form speech.

Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition

TL;DR

Abstract

Paper Structure (16 sections, 8 equations, 3 figures, 7 tables)

This paper contains 16 sections, 8 equations, 3 figures, 7 tables.

Introduction & Related Work
Global AED Model
Chunked AED Model
Streamable Chunked Encoder
Streamable Chunked Decoder
Training
Beam Search
Experiments
Chunked Decoder
Chunked Encoder-Decoder
Latency
Long-Form Recognition
Beam Size and Length Normalization
External Language Model
Comparison to Transducer
...and 1 more sections

Figures (3)

Figure 1: Chunking on input frames $x_{1:T}$ with chunk center size $T_w$, right context $T_r$ and stride $T_s$, where we have $T_s = T_w$.
Figure 2: Chunked self-attention in the encoder.
Figure 3: Possible transition sequences $a_{1:K+N}$ for non-EOC label sequence $ABC$ with length $N=3$ and $K=4$ chunks, where $\varepsilon$ is the end-of-chunk (EOC) symbol.

Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition

TL;DR

Abstract

Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (3)