Table of Contents
Fetching ...

Voxtral Realtime

Alexander H. Liu, Andy Ehrenberg, Andy Lo, Chen-Yo Sun, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, Rohin Arora, Sanchit Gandhi, Sandeep Subramanian, Soham Ghosh, Srijan Mishra, Abhinav Rastogi, Alan Jeffares, Albert Jiang, Alexandre Sablayrolles, Amélie Héliou, Andrew Bai, Angele Lenglemetz, Anmol Agarwal, Anton Eliseev, Antonia Calvi, Arjun Majumdar, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Benjamin Tibi, Clémence Lanfranchi, Connor Chen, Corentin Barreau, Corentin Sautier, Cyprien Courtot, Darius Dabert, Diego de las Casas, Elliot Chane-Sane, Enguerrand Paquin, Faruk Ahmed, Federico Baldassarre, Gabrielle Berrada, Gaëtan Ecrepont, Gauthier Guinet, Genevieve Hayes, Georgii Novikov, Giada Pistilli, Guillaume Martin, Gunjan Dhanuka, Gunshi Gupta, Han Zhou, Indraneel Mukherjee, Irene Zhang, Jaeyoung Kim, Jan Ludziejewski, Jason Rute, Joachim Studnia, John Harvill, Jonas Amar, Josselin Somerville Roberts, Julien Tauran, Karmesh Yadav, Kartik Khandelwal, Kush Jain, Laurence Aitchison, Léonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Maarten Buyl, Manan Sharma, Margaret Jennings, Marie Pellat, Mark Prins, Mathieu Poirée, Mathilde Guillaumin, Matthieu Dinot, Matthieu Futeral, Maxime Darrin, Maximilian Augustin, Mert Unsal, Mia Chiquier, Nathan Grinsztajn, Neha Gupta, Olivier Bousquet, Olivier Duchenne, Patricia Wang, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Philomène Chagniot, Pierre Stock, Piotr Miłoś, Prateek Gupta, Pravesh Agrawal, Quentin Torroba, Ram Ramrakhya, Rishi Shah, Romain Sauvestre, Roman Soletskyi, Rosalie Millner, Sagar Vaze, Samuel Humeau, Siddharth Gandhi, Sumukh Aithal, Szymon Antoniak, Teven Le Scao, Théo Cachet, Theo Simon Sorg, Thibaut Lavril, Thomas Chabal, Thomas Foubert, Thomas Robert, Thomas Wang, Tim Lawson, Tom Bewley, Tom Edwards, Tyler Wang, Valeriia Nemychnikova, Van Phung, Vedant Nanda, Victor Jouault, Virgile Richard, Vladislav Bataev, Wassim Bouaziz, Wen-Ding Li, William Marshall, Xinghui Li, Xingran Guo, Xinyu Yang, Yannic Neuhaus, Yihan Wang, Zaccharie Ramzi, Zhenlin Xu

TL;DR

Voxtral Realtime tackles the challenge of achieving offline-level transcription quality with sub-second latency in a fully streaming ASR system. It introduces a 4B parameter, end-to-end streaming model with a causal audio encoder, a downsampling adapter, and a decoder conditioned by an adaptive delay mechanism (Ada RMS-Norm) that enables flexible latency. Trained with frame-synchronous targets and a delay-sampling strategy across 13 languages, the model reaches competitive performance at $\tau=480\ \mathrm{ms}$ and surpasses strong streaming baselines at higher delays, while remaining practical for real-time deployment via vLLM integration and a WebSocket API. The work demonstrates that end-to-end streaming DSP can approach, and in some regimes match, offline transcription quality, and it releases the weights under the Apache 2.0 license to foster broader adoption and research.

Abstract

We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.

Voxtral Realtime

TL;DR

Voxtral Realtime tackles the challenge of achieving offline-level transcription quality with sub-second latency in a fully streaming ASR system. It introduces a 4B parameter, end-to-end streaming model with a causal audio encoder, a downsampling adapter, and a decoder conditioned by an adaptive delay mechanism (Ada RMS-Norm) that enables flexible latency. Trained with frame-synchronous targets and a delay-sampling strategy across 13 languages, the model reaches competitive performance at and surpasses strong streaming baselines at higher delays, while remaining practical for real-time deployment via vLLM integration and a WebSocket API. The work demonstrates that end-to-end streaming DSP can approach, and in some regimes match, offline transcription quality, and it releases the weights under the Apache 2.0 license to foster broader adoption and research.

Abstract

We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.
Paper Structure (23 sections, 2 equations, 5 figures, 8 tables)

This paper contains 23 sections, 2 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Voxtral Realtime approaches offline accuracy at sub-second latency. Macro-average word error-rate (WER) vs. delay on the FLEURS multilingual benchmark for realtime and offline models. Lower is better. At 480 ms delay, Voxtral Realtime is competitive with Scribe v2 Realtime, the leading realtime API model, as well as Whisper, the most popular open-source offline model. It surpasses both baselines at 960 ms delay, approaching the performance of Voxtral Mini Transcribe V2, a state-of-the-art offline transcription model.
  • Figure 2: Voxtral Realtime architecture and decoding scheme for a target delay $\tau=80$ ms. Voxtral Realtime consists of a causal audio encoder to embed the input audio stream, an MLP adapter layer to temporally downsample the audio embeddings, and a text decoder to auto-regressively generate the output text stream. The downsampled audio embeddings from the adapter and the embeddings of previously generated tokens have the same frame-rate of 12.5Hz, with each frame representing 80ms of audio. These are summed and processed by the text decoder, which predicts one token per frame. The decoder emits a padding token [P] while waiting for sufficient acoustic evidence. Once a word is acoustically complete and the target delay $\tau$ has elapsed, a word-boundary token [W] is emitted to initiate generation, followed by the corresponding subword tokens.
  • Figure 3: Ablation of delay-conditioning mechanisms. Word error-rate on three languages from the FLEURS dataset as a function of training progress. Ada RMS-Norm consistently improves convergence speed and final accuracy compared to alternative conditioning strategies.
  • Figure 4: Ablation of target construction schemes. Word error-rate on three languages from the FLEURS dataset as a function of training progress. Inserting a single word-boundary token [W] per-group better preserves the capabilities of the pre-trained language decoder than inserting a [W] per-word.
  • Figure 5: Voxtral streaming session via vLLM resumable requests. A session is created with an anchor request that includes the initial buffered audio (e.g., the first $\tau$ ms plus padding tokens to enforce the target delay) and runs a one-token decoder step. Each subsequent update is sent as a resumable request that appends the next 80 ms audio chunk together with the previously emitted token ID, allowing the engine to reuse cached KV states and emit the next token incrementally. This request--decode--update loop enables low-latency, continuous transcription with full-duplex streaming-input/streaming-output.