Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
Frank Seide, Morrie Doulaty, Yangyang Shi, Yashesh Gaur, Junteng Jia, Chunyang Wu
TL;DR
Speech ReaLLM introduces a real-time, streaming ASR paradigm by combining decoder-only LLMs with an RNN-T–like BLANK mechanism, enabling continuous audio processing without explicit end-pointing. The approach is trained with a time-aligned target via an external alignment teacher and interleaves speech and word tokens to produce immediate hypotheses after each input, achieving competitive WERs on Librispeech without external LMs. Across experiments, Small from-scratch 80M models demonstrate strong streaming performance and robustness to long utterances, while a pre-trained 7B Llama-2 can be fine-tuned to learn time behavior with mixed results. Overall, Speech ReaLLM advances real-time multimodal processing by showing that decoder-only architectures can be trained to model the flow of time and may extend to broader real-time applications beyond ASR.
Abstract
We introduce Speech ReaLLM, a new ASR architecture that marries "decoder-only" ASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. This is the first "decoder-only" ASR architecture designed to handle continuous audio without explicit end-pointing. Speech ReaLLM is a special case of the more general ReaLLM ("real-time LLM") approach, also introduced here for the first time. The idea is inspired by RNN-T: Instead of generating a response only at the end of a user prompt, generate after every input token received in real time (it is often empty). On Librispeech "test", an 80M Speech ReaLLM achieves WERs of 3.0% and 7.4% in real time (without an external LM or auxiliary loss). This is only slightly above a 3x larger Attention-Encoder-Decoder baseline. We also show that this way, an LLM architecture can learn to represent and reproduce the flow of time; and that a pre-trained 7B LLM can be fine-tuned to do reasonably well on this task.
