VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation
Yuhao Wang, Heyang Liu, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang
TL;DR
VocalNet introduces a pair of speech LLMs (VocalNet-1B and VocalNet-8B) that leverage multi-token prediction (MTP) to accelerate speech generation while maintaining high quality. The architecture fuses a speech encoder, an LLM backbone, and a speech decoder with a two-stage training strategy and streaming attention masks to enable real-time interaction. The MTP design uses sequential transformer modules and multiple output heads with a decaying loss to capture local speech patterns and mitigate error accumulation, outperforming group-modeling approaches at high speedups. Experiments on OpenAudioBench show VocalNet achieving strong modality alignment and acoustic quality with relatively modest training data, and the work provides substantial reproducibility through public releases. Overall, the paper demonstrates that MTP can significantly boost latency and quality in speech LLMs, enabling practical, low-latency voice interactions with open-release resources.
Abstract
Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We introduce VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework designed for real-time voice interaction. Central to our contribution is the first application of multi-token prediction (MTP) to speech LLMs. This approach represents a paradigm shift from standard next-token prediction (NTP), offering simultaneous improvements in generation speed and quality. Informed by analysis of MTP's effect on speech generation and experimental comparisons, we designed a straightforward and highly effective MTP implementation. Experiments demonstrate that VocalNet performs on par with mainstream Omni LLMs even with limited training data, and significantly surpasses existing open-source speech LLMs. To foster reproducibility and community advancement, all model weights, inference code, training data, and framework implementations have been made publicly available at https://github.com/SJTU-OmniAgent/VocalNet
