Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition
Mohammad Zeineldeen, Albert Zeyer, Ralf Schlüter, Hermann Ney
TL;DR
This work introduces a streamable chunked attention-based encoder–decoder (AED) for streaming speech recognition. By processing fixed-size chunks with an end-of-chunk (EOC) symbol, the model becomes equivalent to a transducer operating on chunks and can be trained and decoded with alignment-synchronous strategies. Experiments on LibriSpeech and TED-LIUM-v2 show that a chunked decoder or a chunked encoder–decoder can achieve competitive WERs compared to a global AED, with strong generalization to long-form speech and favorable latency properties. The results highlight the practicality of chunked attention for streaming ASR and establish useful links to transducer models, especially when incorporating external language models and ILM corrections.
Abstract
We study a streamable attention-based encoder-decoder model in which either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks. A special end-of-chunk (EOC) symbol advances from one chunk to the next chunk, effectively replacing the conventional end-of-sequence symbol. This modification, while minor, situates our model as equivalent to a transducer model that operates on chunks instead of frames, where EOC corresponds to the blank symbol. We further explore the remaining differences between a standard transducer and our model. Additionally, we examine relevant aspects such as long-form speech generalization, beam size, and length normalization. Through experiments on Librispeech and TED-LIUM-v2, and by concatenating consecutive sequences for long-form trials, we find that our streamable model maintains competitive performance compared to the non-streamable variant and generalizes very well to long-form speech.
