Exploring the limits of decoder-only models trained on public speech recognition corpora
Ankit Gupta, George Saon, Brian Kingsbury
TL;DR
The paper asks whether decoder-only Transformer ASR models trained solely on public data can match encoder-decoder and proprietary systems. It introduces DOTA, a decoder-only ASR trained on 93K hours of public English data, and systematically analyzes data composition, audio processing, model design (causal vs prefix attention), and training. Results show DOTA achieving competitive word-error-rate with substantially fewer parameters than Whisper and outperforming Whisper large-v3 on several test sets, particularly when using bidirectional audio framing; OWSM performance is also surpassed on most reported sets. The work demonstrates the viability of open, decoder-only ASR pipelines and provides actionable guidance for building high-performance public-data ASR systems, plus releasing code and checkpoints to the community.
Abstract
The emergence of industrial-scale speech recognition (ASR) models such as Whisper and USM, trained on 1M hours of weakly labelled and 12M hours of audio only proprietary data respectively, has led to a stronger need for large scale public ASR corpora and competitive open source pipelines. Unlike the said models, large language models are typically based on Transformer decoders, and it remains unclear if decoder-only models trained on public data alone can deliver competitive performance. In this work, we investigate factors such as choice of training datasets and modeling components necessary for obtaining the best performance using public English ASR corpora alone. Our Decoder-Only Transformer for ASR (DOTA) model comprehensively outperforms the encoder-decoder open source replication of Whisper (OWSM) on nearly all English ASR benchmarks and outperforms Whisper large-v3 on 7 out of 15 test sets. We release our codebase and model checkpoints under permissive license.
