Table of Contents
Fetching ...

How Much Context Does My Attention-Based ASR System Need?

Robert Flynn, Anton Ragni

TL;DR

This paper empirically examines how much acoustic context dense-attention ASR models can effectively utilize, exploring context lengths from 5 seconds to 1 hour using a large Spotify podcast corpus. It introduces modifications to enable long-context training and evaluation, including a FastConformer architecture with Flash Attention, moving window decoding, sequence length warmup, and various positional encodings. The study finds that training with up to approximately 21.8 minutes of context yields meaningful WER improvements (up to 14.5% relative on Earnings-22), and that 1-hour contexts are trainable without degradation under the proposed setup; longer contexts also improve robustness to domain shifts, with positional encoding and model size significantly affecting gains. The results offer guidance on when longer context pays off, highlight rotary positional encoding as favorable for long sequences, and suggest that deeper, wider models are necessary to exploit extended context, while head count interacts with sequence length in nuanced ways. The work also provides resources (checkpoints and code) to support further research in long-context ASR and related interpretability analyses.

Abstract

For the task of speech recognition, the use of more than 30 seconds of acoustic context during training is uncommon and under-investigated in literature. In this work, we conduct an empirical study on the effect of scaling the sequence length used to train/evaluate (dense-attention-based) acoustic models on speech recognition performance. For these experiments, a dataset of roughly 100,000 pseudo-labelled Spotify podcasts is used, with context lengths of 5 seconds to 1 hour being explored. Zero-shot evaluations are presented on the long-format datasets: Earnings-22, Tedlium and Rev16. Results demonstrate a benefit from training with up to 21.8 minutes of acoustic context, showing up to a 14.5\% relative improvement from a baseline trained with 10 seconds of context. We find that the model's width/depth, positional encoding scheme and number of attention heads impact its ability to use longer contexts.

How Much Context Does My Attention-Based ASR System Need?

TL;DR

This paper empirically examines how much acoustic context dense-attention ASR models can effectively utilize, exploring context lengths from 5 seconds to 1 hour using a large Spotify podcast corpus. It introduces modifications to enable long-context training and evaluation, including a FastConformer architecture with Flash Attention, moving window decoding, sequence length warmup, and various positional encodings. The study finds that training with up to approximately 21.8 minutes of context yields meaningful WER improvements (up to 14.5% relative on Earnings-22), and that 1-hour contexts are trainable without degradation under the proposed setup; longer contexts also improve robustness to domain shifts, with positional encoding and model size significantly affecting gains. The results offer guidance on when longer context pays off, highlight rotary positional encoding as favorable for long sequences, and suggest that deeper, wider models are necessary to exploit extended context, while head count interacts with sequence length in nuanced ways. The work also provides resources (checkpoints and code) to support further research in long-context ASR and related interpretability analyses.

Abstract

For the task of speech recognition, the use of more than 30 seconds of acoustic context during training is uncommon and under-investigated in literature. In this work, we conduct an empirical study on the effect of scaling the sequence length used to train/evaluate (dense-attention-based) acoustic models on speech recognition performance. For these experiments, a dataset of roughly 100,000 pseudo-labelled Spotify podcasts is used, with context lengths of 5 seconds to 1 hour being explored. Zero-shot evaluations are presented on the long-format datasets: Earnings-22, Tedlium and Rev16. Results demonstrate a benefit from training with up to 21.8 minutes of acoustic context, showing up to a 14.5\% relative improvement from a baseline trained with 10 seconds of context. We find that the model's width/depth, positional encoding scheme and number of attention heads impact its ability to use longer contexts.
Paper Structure (16 sections, 5 figures, 1 table)

This paper contains 16 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: WER reduction from a sequence length of 10s
  • Figure 2: WER reduction from 10s baseline on Rev16 with varying amounts of background music
  • Figure 3: WER at various sequence length for different positional encoding methods on Earnings-22
  • Figure 4: WER reduction from a sequence length of 10s for various model sizes on Earnings-22
  • Figure 5: WER on Earnings-22 when varying the number of attention heads (H)