Table of Contents
Fetching ...

VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

Yuhao Wang, Heyang Liu, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

TL;DR

VocalNet introduces a pair of speech LLMs (VocalNet-1B and VocalNet-8B) that leverage multi-token prediction (MTP) to accelerate speech generation while maintaining high quality. The architecture fuses a speech encoder, an LLM backbone, and a speech decoder with a two-stage training strategy and streaming attention masks to enable real-time interaction. The MTP design uses sequential transformer modules and multiple output heads with a decaying loss to capture local speech patterns and mitigate error accumulation, outperforming group-modeling approaches at high speedups. Experiments on OpenAudioBench show VocalNet achieving strong modality alignment and acoustic quality with relatively modest training data, and the work provides substantial reproducibility through public releases. Overall, the paper demonstrates that MTP can significantly boost latency and quality in speech LLMs, enabling practical, low-latency voice interactions with open-release resources.

Abstract

Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We introduce VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework designed for real-time voice interaction. Central to our contribution is the first application of multi-token prediction (MTP) to speech LLMs. This approach represents a paradigm shift from standard next-token prediction (NTP), offering simultaneous improvements in generation speed and quality. Informed by analysis of MTP's effect on speech generation and experimental comparisons, we designed a straightforward and highly effective MTP implementation. Experiments demonstrate that VocalNet performs on par with mainstream Omni LLMs even with limited training data, and significantly surpasses existing open-source speech LLMs. To foster reproducibility and community advancement, all model weights, inference code, training data, and framework implementations have been made publicly available at https://github.com/SJTU-OmniAgent/VocalNet

VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

TL;DR

VocalNet introduces a pair of speech LLMs (VocalNet-1B and VocalNet-8B) that leverage multi-token prediction (MTP) to accelerate speech generation while maintaining high quality. The architecture fuses a speech encoder, an LLM backbone, and a speech decoder with a two-stage training strategy and streaming attention masks to enable real-time interaction. The MTP design uses sequential transformer modules and multiple output heads with a decaying loss to capture local speech patterns and mitigate error accumulation, outperforming group-modeling approaches at high speedups. Experiments on OpenAudioBench show VocalNet achieving strong modality alignment and acoustic quality with relatively modest training data, and the work provides substantial reproducibility through public releases. Overall, the paper demonstrates that MTP can significantly boost latency and quality in speech LLMs, enabling practical, low-latency voice interactions with open-release resources.

Abstract

Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We introduce VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework designed for real-time voice interaction. Central to our contribution is the first application of multi-token prediction (MTP) to speech LLMs. This approach represents a paradigm shift from standard next-token prediction (NTP), offering simultaneous improvements in generation speed and quality. Informed by analysis of MTP's effect on speech generation and experimental comparisons, we designed a straightforward and highly effective MTP implementation. Experiments demonstrate that VocalNet performs on par with mainstream Omni LLMs even with limited training data, and significantly surpasses existing open-source speech LLMs. To foster reproducibility and community advancement, all model weights, inference code, training data, and framework implementations have been made publicly available at https://github.com/SJTU-OmniAgent/VocalNet

Paper Structure

This paper contains 32 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: On the left: The architecture of the VocalNet model. On the right: A depiction of VocalNet's dual-stage training strategy.
  • Figure 2: (a) Non-Streaming Attention Mask: $\boldsymbol{v}_{LLM}^i$ attends to itself and all text positions, and $\boldsymbol{s}^i$ attends to itself, all text positions, and its previous speech positions; (b) Streaming Attention Mask: $\boldsymbol{v}_{LLM}^i$ attends to itself and its previous text positions, and $\boldsymbol{s}^i$ attends to itself, chunk-limited text positions, and its previous speech positions.
  • Figure 3: Distribution of maximum probabilities and entropy values for 70k predicted speech tokens from VocalNet-1B, trained with the NTP task. Red dashed lines represent the means.
  • Figure 4: Illustration of various accelerate implementations. (a): Group Modeling; (b): MTP-Parallel-Linear; (c): MTP-DeepSeek; (d): Our MTP implementation.