Beyond the Utterance: An Empirical Study of Very Long Context Speech Recognition

Robert Flynn; Anton Ragni

Beyond the Utterance: An Empirical Study of Very Long Context Speech Recognition

Robert Flynn, Anton Ragni

TL;DR

This study systematically investigates very long-context automatic speech recognition using encoder-only Conformer models trained on sequence lengths from $10\ \text{s}$ to $1\ \text{hour}$. It demonstrates that meaningful gains emerge with up to $21.8\ \text{minutes}$ of context, with up to $14.2\%$ relative WER reduction over short-context baselines, particularly under domain shifts. The authors introduce training and evaluation adaptations—Flash Attention, 8× subsampling, sequence-length warmup, and three evaluation schemes—to enable and fairly assess long-context usage, and they reveal that both linguistic and acoustic aspects of distant context contribute to performance. Key findings include the importance of rotary positional encodings, sufficient model size, and robust context handling in the presence of sudden context changes, supported by synthetic and cross-dataset analyses. The work provides practical insights for deploying long-context ASR and points to future work on alternative architectures and data distributions to further leverage extended context.

Abstract

Automatic speech recognition (ASR) models are normally trained to operate over single utterances, with a short duration of less than 30 seconds. This choice has been made in part due to computational constraints, but also reflects a common, but often inaccurate, modelling assumption that treats utterances as independent and identically distributed samples. When long-format audio recordings are available, to work with such systems, these recordings must first be segmented into short utterances and processed independently. In this work, we show that due to recent algorithmic and hardware advances, this is no longer necessary, and current attention-based approaches can be used to train ASR systems that operate on sequences of over an hour in length. Therefore, to gain a better understanding of the relationship between the training/evaluation sequence length and performance, we train ASR models on large-scale data using 10 different sequence lengths from 10 seconds up to 1 hour. The results show a benefit from using up to 21.8 minutes of context, with up to a 14.2% relative improvement from a short context baseline in our primary experiments. Through modifying various architectural components, we find that the method of encoding positional information and the model's width/depth are important factors when working with long sequences. Finally, a series of evaluations using synthetic data are constructed to help analyse the model's use of context. From these results, it is clear that both linguistic and acoustic aspects of the distant context are being used by the model.

Beyond the Utterance: An Empirical Study of Very Long Context Speech Recognition

TL;DR

This study systematically investigates very long-context automatic speech recognition using encoder-only Conformer models trained on sequence lengths from

. It demonstrates that meaningful gains emerge with up to

of context, with up to

relative WER reduction over short-context baselines, particularly under domain shifts. The authors introduce training and evaluation adaptations—Flash Attention, 8× subsampling, sequence-length warmup, and three evaluation schemes—to enable and fairly assess long-context usage, and they reveal that both linguistic and acoustic aspects of distant context contribute to performance. Key findings include the importance of rotary positional encodings, sufficient model size, and robust context handling in the presence of sudden context changes, supported by synthetic and cross-dataset analyses. The work provides practical insights for deploying long-context ASR and points to future work on alternative architectures and data distributions to further leverage extended context.

Abstract

Paper Structure (32 sections, 2 equations, 16 figures, 4 tables)

This paper contains 32 sections, 2 equations, 16 figures, 4 tables.

Introduction
Prior Work
Long-Context Acoustic Models
Long-Context Language Modelling
Modifications For training with long sequences
Flash Attention
Architecture
Sequence Length Warmup
Positional Encoding
No Positional Encodings (NoPos)
Sinusoidal Positional Encodings
Rotary Positional Encodings su2021roformer
Modifications For Evaluating with Long Sequences
Moving Averaged Window Decoding
Buffered Window Decoding
...and 17 more sections

Figures (16)

Figure 1: Training throughput (hours of audio processed per second on an H100 GPU) at different sequence lengths.
Figure 2: Moving averaged window decoding.
Figure 3: Buffered window decoding.
Figure 4: Sliding Window Attention.
Figure 5: Depiction of context fragmentation when a recording is segmented. (Top) Sequence length of 10s results in 1 area of fragmentation. (Bottom) Sequence length of 5s results in 3 areas of fragmentation.
...and 11 more figures

Beyond the Utterance: An Empirical Study of Very Long Context Speech Recognition

TL;DR

Abstract

Beyond the Utterance: An Empirical Study of Very Long Context Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (16)