Table of Contents
Fetching ...

What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study

Xiaoran Fan, Zhichao Sun, Yangfan Gao, Jingfei Xiong, Hang Yan, Yifei Cao, Jiajun Sun, Shuo Li, Zhihao Zhang, Zhiheng Xi, Yuhao Zhou, Senjie Jin, Changhao Jiang, Junjie Ye, Ming Zhang, Rui Zheng, Zhenhua Han, Yunke Zhang, Demei Yan, Shaokang Dong, Tao Ji, Tao Gui

TL;DR

This work systematically analyzes how speech tokenizer design affects cross-modal alignment and speech generation in LLM-centric speech-language models. It shows that fully decoupled tokenizers paired with Multi Token Prediction (MTP) significantly improve alignment and synthesis efficiency, achieving up to $12\times$ faster decoding and lowering $WER$ from $6.07$ to $3.01$. The introduction of speaker-aware generation and a role-based QA benchmark (RoleTriviaQA) demonstrates improved speaker timbre control and knowledge-grounded responses, with decoupled tokenizers delivering state-of-the-art performance on ID and robust generalization to OOD. Together, these findings offer practical guidance for building scalable, knowledge-rich SLMs with controllable speaker characteristics.

Abstract

Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12$\times$ faster decoding and a substantial drop in word error rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.

What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study

TL;DR

This work systematically analyzes how speech tokenizer design affects cross-modal alignment and speech generation in LLM-centric speech-language models. It shows that fully decoupled tokenizers paired with Multi Token Prediction (MTP) significantly improve alignment and synthesis efficiency, achieving up to faster decoding and lowering from to . The introduction of speaker-aware generation and a role-based QA benchmark (RoleTriviaQA) demonstrates improved speaker timbre control and knowledge-grounded responses, with decoupled tokenizers delivering state-of-the-art performance on ID and robust generalization to OOD. Together, these findings offer practical guidance for building scalable, knowledge-rich SLMs with controllable speaker characteristics.

Abstract

Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12 faster decoding and a substantial drop in word error rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.

Paper Structure

This paper contains 67 sections, 8 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Left: Overview of a Speech Language Model (SLM) trained with a decoupled speech tokenizer (Section \ref{['ssec:slm']}) and Speaker-Aware TTS (Section \ref{['ssec:sa_tts']}); Right: The architecture of a possible decoupled speech tokenizer, featuring speech quantization, reconstruction in a decoupled manner, and speaker-specified embedding extraction.
  • Figure 2: Illustration of our NTP and MTP architecture. (a) NTP: single vocabulary and single prediction head; (b) MTP: multiple vocabularies and multiple prediction heads, generating multiple tokens in parallel.
  • Figure 3: Illustration of Role-Playing Knowledge QA.
  • Figure 4: Training loss comparison between SLMs trained with baselines and FACodec across TTS training (upper figure) and 2-stage training for Role-Playing Knowledge QA task (middle and lower figures).
  • Figure 5: UMAP visualization of speech and text embedding distributions across the word embeddings, the middle and the last layer in MTP models of different numbers of speech heads (3H, 6H, and 12H), illustrating how the relative distance between modalities changes when the number of speech heads increases.