Table of Contents
Fetching ...

Voxtral

Alexander H. Liu, Andy Ehrenberg, Andy Lo, Clément Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, Sanchit Gandhi, Soham Ghosh, Srijan Mishra, Thomas Foubert, Abhinav Rastogi, Adam Yang, Albert Q. Jiang, Alexandre Sablayrolles, Amélie Héliou, Amélie Martin, Anmol Agarwal, Antoine Roux, Arthur Darcet, Arthur Mensch, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Chris Bamford, Christian Wallenwein, Christophe Renaudin, Clémence Lanfranchi, Darius Dabert, Devendra Singh Chaplot, Devon Mizelle, Diego de las Casas, Elliot Chane-Sane, Emilien Fugier, Emma Bou Hanna, Gabrielle Berrada, Gauthier Delerce, Gauthier Guinet, Georgii Novikov, Guillaume Martin, Himanshu Jaju, Jan Ludziejewski, Jason Rute, Jean-Hadrien Chabran, Jessica Chudnovsky, Joachim Studnia, Joep Barmentlo, Jonas Amar, Josselin Somerville Roberts, Julien Denize, Karan Saxena, Karmesh Yadav, Kartik Khandelwal, Kush Jain, Lélio Renard Lavaud, Léonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Marie Pellat, Mathilde Guillaumin, Mathis Felardos, Matthieu Dinot, Maxime Darrin, Maximilian Augustin, Mickaël Seznec, Neha Gupta, Nikhil Raghuraman, Olivier Duchenne, Patricia Wang, Patryk Saffer, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Philomène Chagniot, Pierre Stock, Pravesh Agrawal, Rémi Delacourt, Romain Sauvestre, Roman Soletskyi, Sagar Vaze, Sandeep Subramanian, Saurabh Garg, Shashwat Dalal, Siddharth Gandhi, Sumukh Aithal, Szymon Antoniak, Teven Le Scao, Thibault Schueller, Thibaut Lavril, Thomas Robert, Thomas Wang, Timothée Lacroix, Tom Bewley, Valeriia Nemychnikova, Victor Paltz, Virgile Richard, Wen-Ding Li, William Marshall, Xuanyu Zhang, Yihan Wan, Yunhao Tang

TL;DR

Voxtral presents two open-weights multimodal models (Mini and Small) that jointly process speech and text with a 32K context window, enabling long audio handling and strong transcription/translation performance. The architecture combines a Whisper-based audio encoder, a 4x downsampling adapter, and a transformer-based language decoder in two backbones to balance compute and accuracy. Training proceeds in three phases—pretraining with dual audio-text patterns, supervised finetuning for transcription and understanding tasks, and preference alignment via (online) DPO—augmented by novel speech understanding benchmarks, including speech-synthesized versions of established QA datasets. Results show state-of-the-art transcription and translation for open weights, competitive speech understanding, and robust text capabilities, all released under the Apache 2.0 license to enable local deployment and broad adoption.

Abstract

We present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a diverse range of audio benchmarks, while preserving strong text capabilities. Voxtral Small outperforms a number of closed-source models, while being small enough to run locally. A 32K context window enables the model to handle audio files up to 40 minutes in duration and long multi-turn conversations. We also contribute three benchmarks for evaluating speech understanding models on knowledge and trivia. Both Voxtral models are released under Apache 2.0 license.

Voxtral

TL;DR

Voxtral presents two open-weights multimodal models (Mini and Small) that jointly process speech and text with a 32K context window, enabling long audio handling and strong transcription/translation performance. The architecture combines a Whisper-based audio encoder, a 4x downsampling adapter, and a transformer-based language decoder in two backbones to balance compute and accuracy. Training proceeds in three phases—pretraining with dual audio-text patterns, supervised finetuning for transcription and understanding tasks, and preference alignment via (online) DPO—augmented by novel speech understanding benchmarks, including speech-synthesized versions of established QA datasets. Results show state-of-the-art transcription and translation for open weights, competitive speech understanding, and robust text capabilities, all released under the Apache 2.0 license to enable local deployment and broad adoption.

Abstract

We present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a diverse range of audio benchmarks, while preserving strong text capabilities. Voxtral Small outperforms a number of closed-source models, while being small enough to run locally. A 32K context window enables the model to handle audio files up to 40 minutes in duration and long multi-turn conversations. We also contribute three benchmarks for evaluating speech understanding models on knowledge and trivia. Both Voxtral models are released under Apache 2.0 license.

Paper Structure

This paper contains 30 sections, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Voxtral Architecture. The audio encoder processes the speech input, attending to 30-second chunks of audio independently. The audio embeddings are concatenated at the output, and downsampled by a factor of 4x in the audio-language adapter. The multimodal LLM decoder auto-regressively predicts text tokens, conditional on the audio and text inputs.
  • Figure 2: Pretraining patterns. A single audio-text example $(A,T)$ is first segmented into a set of audio-text pairs $\left\{(A_n, T_n)\right\}_{n=1}^{N}$, based on the timestamps and transcriptions returned by segmentation stage. For the audio-to-text repetition pattern, a given audio $A_n$ is repeated in the text space $T_n$. For the cross-modal continuation pattern, each audio $A_n$ is followed by its subsequent text $T_{n+1}$. The task is signaled to the model by the <repeat> and <next> special tokens respectively.
  • Figure 3: Speech Recognition Benchmarks. Macro-average WER results across tasks. Voxtral Small outperforms all open and closed-source models on English Short-Form and MCV. Voxtral Mini Transcribe beats GPT-4o mini Transcribe and Gemini 2.5 Flash in every task.
  • Figure 4: FLEURS Translation. BLEU scores for source/target language pairs on the FLEURS Translation benchmark. Voxtral Small achieves state-of-the-art for every combination of languages.
  • Figure 5: Speech Understanding Benchmarks. We report the accuracy across three speech understanding benchmarks and three synthesized speech subsets of text benchmarks. Voxtral Small is competitive with closed-source models, surpassing GPT-4o mini Audio on three of the seven benchmarks.
  • ...and 4 more figures