Beyond the Utterance: An Empirical Study of Very Long Context Speech Recognition
Robert Flynn, Anton Ragni
TL;DR
This study systematically investigates very long-context automatic speech recognition using encoder-only Conformer models trained on sequence lengths from $10\ \text{s}$ to $1\ \text{hour}$. It demonstrates that meaningful gains emerge with up to $21.8\ \text{minutes}$ of context, with up to $14.2\%$ relative WER reduction over short-context baselines, particularly under domain shifts. The authors introduce training and evaluation adaptations—Flash Attention, 8× subsampling, sequence-length warmup, and three evaluation schemes—to enable and fairly assess long-context usage, and they reveal that both linguistic and acoustic aspects of distant context contribute to performance. Key findings include the importance of rotary positional encodings, sufficient model size, and robust context handling in the presence of sudden context changes, supported by synthetic and cross-dataset analyses. The work provides practical insights for deploying long-context ASR and points to future work on alternative architectures and data distributions to further leverage extended context.
Abstract
Automatic speech recognition (ASR) models are normally trained to operate over single utterances, with a short duration of less than 30 seconds. This choice has been made in part due to computational constraints, but also reflects a common, but often inaccurate, modelling assumption that treats utterances as independent and identically distributed samples. When long-format audio recordings are available, to work with such systems, these recordings must first be segmented into short utterances and processed independently. In this work, we show that due to recent algorithmic and hardware advances, this is no longer necessary, and current attention-based approaches can be used to train ASR systems that operate on sequences of over an hour in length. Therefore, to gain a better understanding of the relationship between the training/evaluation sequence length and performance, we train ASR models on large-scale data using 10 different sequence lengths from 10 seconds up to 1 hour. The results show a benefit from using up to 21.8 minutes of context, with up to a 14.2% relative improvement from a short context baseline in our primary experiments. Through modifying various architectural components, we find that the method of encoding positional information and the model's width/depth are important factors when working with long sequences. Finally, a series of evaluations using synthetic data are constructed to help analyse the model's use of context. From these results, it is clear that both linguistic and acoustic aspects of the distant context are being used by the model.
