Table of Contents
Fetching ...

Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

Awni Hannun, Ann Lee, Qiantong Xu, Ronan Collobert

TL;DR

The paper tackles the efficiency gap in end-to-end speech recognition by introducing a fully convolutional sequence-to-sequence model with time-depth separable convolutions (TDS) in the encoder. It pairs the encoder with a simple, fast decoder and a stable beam-search pipeline that can leverage external language models, achieving state-of-the-art end-to-end WER on LibriSpeech while using far fewer parameters than competitive RNN baselines. Key contributions include the TDS block design, training-time strategies (soft window pre-training, random sampling, word-piece sampling, dropout, label smoothing), and beam-search stabilizers, all enabling large-beam LM integration without performance degradation. The results demonstrate major efficiency gains (training/decoding) and substantial WER improvements, highlighting the practicality of convolutional architectures for large-scale ASR.

Abstract

We propose a fully convolutional sequence-to-sequence encoder architecture with a simple and efficient decoder. Our model improves WER on LibriSpeech while being an order of magnitude more efficient than a strong RNN baseline. Key to our approach is a time-depth separable convolution block which dramatically reduces the number of parameters in the model while keeping the receptive field large. We also give a stable and efficient beam search inference procedure which allows us to effectively integrate a language model. Coupled with a convolutional language model, our time-depth separable convolution architecture improves by more than 22% relative WER over the best previously reported sequence-to-sequence results on the noisy LibriSpeech test set.

Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

TL;DR

The paper tackles the efficiency gap in end-to-end speech recognition by introducing a fully convolutional sequence-to-sequence model with time-depth separable convolutions (TDS) in the encoder. It pairs the encoder with a simple, fast decoder and a stable beam-search pipeline that can leverage external language models, achieving state-of-the-art end-to-end WER on LibriSpeech while using far fewer parameters than competitive RNN baselines. Key contributions include the TDS block design, training-time strategies (soft window pre-training, random sampling, word-piece sampling, dropout, label smoothing), and beam-search stabilizers, all enabling large-beam LM integration without performance degradation. The results demonstrate major efficiency gains (training/decoding) and substantial WER improvements, highlighting the practicality of convolutional architectures for large-scale ASR.

Abstract

We propose a fully convolutional sequence-to-sequence encoder architecture with a simple and efficient decoder. Our model improves WER on LibriSpeech while being an order of magnitude more efficient than a strong RNN baseline. Key to our approach is a time-depth separable convolution block which dramatically reduces the number of parameters in the model while keeping the receptive field large. We also give a stable and efficient beam search inference procedure which allows us to effectively integrate a language model. Coupled with a convolutional language model, our time-depth separable convolution architecture improves by more than 22% relative WER over the best previously reported sequence-to-sequence results on the noisy LibriSpeech test set.

Paper Structure

This paper contains 22 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The TDS convolution model architecture. (a) The sub-blocks of the TDS convolution layer are (b) a 2D convolution over time followed by (c) a fully connected block.
  • Figure 2: The WER as a function of the receptive field. We vary the kernel size, $k \in \{5, 9, 13, 17, 21\}$, otherwise every model has $\sim$36.5 million parameters. We report the mean WER over three runs using a beam size of 1 and no LM.
  • Figure 3: The WER as a function of beam size for both the 4-gram and the convLM.