Table of Contents
Fetching ...

Improved training of end-to-end attention models for speech recognition

Albert Zeyer, Kazuki Irie, Ralf Schlüter, Hermann Ney

TL;DR

Extends end-to-end speech recognition with attention-based seq2seq models trained on BPE subword units. The core contributions are a layer-wise encoder pretraining scheme with high initial time resolution and optional CTC loss, and the integration of shallow fusion with external LMs. The approach achieves competitive results on Switchboard 300h and LibriSpeech 1000h, including state-of-the-art LibriSpeech dev-clean/test-clean WERs with data-only training, and beam-search analysis confirms improvements are limited by the model rather than search. Overall, the work demonstrates practical viability and provides training and decoding strategies for scalable, open-vocabulary ASR.

Abstract

Sequence-to-sequence attention-based models on subword units allow simple open-vocabulary end-to-end speech recognition. In this work, we show that such models can achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks. In particular, we report the state-of-the-art word error rates (WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets of LibriSpeech. We introduce a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance. In some experiments, we also use an auxiliary CTC loss function to help the convergence. In addition, we train long short-term memory (LSTM) language models on subword units. By shallow fusion, we report up to 27% relative improvements in WER over the attention baseline without a language model.

Improved training of end-to-end attention models for speech recognition

TL;DR

Extends end-to-end speech recognition with attention-based seq2seq models trained on BPE subword units. The core contributions are a layer-wise encoder pretraining scheme with high initial time resolution and optional CTC loss, and the integration of shallow fusion with external LMs. The approach achieves competitive results on Switchboard 300h and LibriSpeech 1000h, including state-of-the-art LibriSpeech dev-clean/test-clean WERs with data-only training, and beam-search analysis confirms improvements are limited by the model rather than search. Overall, the work demonstrates practical viability and provides training and decoding strategies for scalable, open-vocabulary ASR.

Abstract

Sequence-to-sequence attention-based models on subword units allow simple open-vocabulary end-to-end speech recognition. In this work, we show that such models can achieve competitive results on the Switchboard 300h and LibriSpeech 1000h tasks. In particular, we report the state-of-the-art word error rates (WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets of LibriSpeech. We introduce a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance. In some experiments, we also use an auxiliary CTC loss function to help the convergence. In addition, we train long short-term memory (LSTM) language models on subword units. By shallow fusion, we report up to 27% relative improvements in WER over the attention baseline without a language model.

Paper Structure

This paper contains 12 sections, 8 equations, 4 tables.