Table of Contents
Fetching ...

Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation

Jiaqi Li, Dongmei Wang, Xiaofei Wang, Yao Qian, Long Zhou, Shujie Liu, Midia Yousefi, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Yanqing Liu, Junkun Chen, Sheng Zhao, Jinyu Li, Zhizheng Wu, Michael Zeng

TL;DR

This work systematically analyzes how neural audio codec tokens influence speech generation within speech language models by retraining three high-performing codecs (Encodec, Vocos, DAC) under a unified setup and integrating them into two SLM frameworks (masked-based parallel generation and AR+NAR generation). It reveals that superior waveform reconstruction does not guarantee better generation, and emphasizes that a high-quality codec decoder is key for naturalness while quantization choices more strongly affect intelligibility. DAC-based codecs emerge as the most effective overall for SLM-based speech generation, with Vocos offering competitive decoding performance, and the study provides concrete design guidance for codec selection and integration to optimize generation quality. The results highlight nuanced trade-offs between reconstruction fidelity, speaker similarity, and intelligibility, informing practical codec design for scalable, token-based speech generation systems.

Abstract

Neural audio codec tokens serve as the fundamental building blocks for speech language model (SLM)-based speech generation. However, there is no systematic understanding on how the codec system affects the speech generation performance of the SLM. In this work, we examine codec tokens within SLM framework for speech generation to provide insights for effective codec design. We retrain existing high-performing neural codec models on the same data set and loss functions to compare their performance in a uniform setting. We integrate codec tokens into two SLM systems: masked-based parallel speech generation system and an auto-regressive (AR) plus non-auto-regressive (NAR) model-based system. Our findings indicate that better speech reconstruction in codec systems does not guarantee improved speech generation in SLM. A high-quality codec decoder is crucial for natural speech production in SLM, while speech intelligibility depends more on quantization mechanism.

Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation

TL;DR

This work systematically analyzes how neural audio codec tokens influence speech generation within speech language models by retraining three high-performing codecs (Encodec, Vocos, DAC) under a unified setup and integrating them into two SLM frameworks (masked-based parallel generation and AR+NAR generation). It reveals that superior waveform reconstruction does not guarantee better generation, and emphasizes that a high-quality codec decoder is key for naturalness while quantization choices more strongly affect intelligibility. DAC-based codecs emerge as the most effective overall for SLM-based speech generation, with Vocos offering competitive decoding performance, and the study provides concrete design guidance for codec selection and integration to optimize generation quality. The results highlight nuanced trade-offs between reconstruction fidelity, speaker similarity, and intelligibility, informing practical codec design for scalable, token-based speech generation systems.

Abstract

Neural audio codec tokens serve as the fundamental building blocks for speech language model (SLM)-based speech generation. However, there is no systematic understanding on how the codec system affects the speech generation performance of the SLM. In this work, we examine codec tokens within SLM framework for speech generation to provide insights for effective codec design. We retrain existing high-performing neural codec models on the same data set and loss functions to compare their performance in a uniform setting. We integrate codec tokens into two SLM systems: masked-based parallel speech generation system and an auto-regressive (AR) plus non-auto-regressive (NAR) model-based system. Our findings indicate that better speech reconstruction in codec systems does not guarantee improved speech generation in SLM. A high-quality codec decoder is crucial for natural speech production in SLM, while speech intelligibility depends more on quantization mechanism.
Paper Structure (23 sections, 4 figures, 4 tables)

This paper contains 23 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: High-level architecture of neural audio codecs
  • Figure 2: Model architecture of masked-based parallel speech generation.
  • Figure 3: AR + NAR speech generation.
  • Figure 4: The log-scale distribution of 1st layer codec tokens.