Table of Contents
Fetching ...

Language-Codec: Bridging Discrete Codec Representations and Speech Language Models

Shengpeng Ji, Minghui Fang, Jialong Zuo, Ziyue Jiang, Dingdong Wang, Hanting Wang, Hai Huang, Zhou Zhao

TL;DR

Language-Codec targets bridging discrete codec representations with downstream speech language models by reducing information bottlenecks in the first quantizer via Masked Channel Residual Vector Quantization (MCRVQ). It introduces a Fourier-transform–based decoder and multi-scale discriminators to improve reconstruction quality while keeping a compact, four-channel codebook for efficient downstream conditioning. Empirical results show state-of-the-art reconstruction across multiple datasets and notable gains in zero-shot TTS speaker similarity and quality, with strong generalization to out-of-domain data. The work positions Language-Codec as a foundational, SL-model-friendly discrete codec for future speech generation research.

Abstract

In recent years, large language models have achieved significant success in generative tasks related to speech, audio, music, and other signal domains. A crucial element of these models is the discrete acoustic codecs, which serve as an intermediate representation replacing the mel-spectrogram. However, there exist several gaps between discrete codecs and downstream speech language models. Specifically, 1) Due to the reconstruction paradigm of the Codec model and the structure of residual vector quantization, the initial channel of the codebooks contains excessive information, making it challenging to directly generate acoustic tokens from weakly supervised signals such as text in downstream tasks. 2) numerous codebooks increases the burden on downstream speech language models. Consequently, leveraging the characteristics of speech language models, we propose Language-Codec. In the Language-Codec, we introduce a Masked Channel Residual Vector Quantization (MCRVQ) mechanism along with improved fourier transform structures and attention blocks, refined discriminator design to address the aforementioned gaps. We compare our method with competing audio compression algorithms and observe significant outperformance across extensive evaluations. Furthermore, we also validate the efficiency of the Language-Codec on downstream speech language models. The source code and pre-trained models can be accessed at https://github.com/jishengpeng/languagecodec .

Language-Codec: Bridging Discrete Codec Representations and Speech Language Models

TL;DR

Language-Codec targets bridging discrete codec representations with downstream speech language models by reducing information bottlenecks in the first quantizer via Masked Channel Residual Vector Quantization (MCRVQ). It introduces a Fourier-transform–based decoder and multi-scale discriminators to improve reconstruction quality while keeping a compact, four-channel codebook for efficient downstream conditioning. Empirical results show state-of-the-art reconstruction across multiple datasets and notable gains in zero-shot TTS speaker similarity and quality, with strong generalization to out-of-domain data. The work positions Language-Codec as a foundational, SL-model-friendly discrete codec for future speech generation research.

Abstract

In recent years, large language models have achieved significant success in generative tasks related to speech, audio, music, and other signal domains. A crucial element of these models is the discrete acoustic codecs, which serve as an intermediate representation replacing the mel-spectrogram. However, there exist several gaps between discrete codecs and downstream speech language models. Specifically, 1) Due to the reconstruction paradigm of the Codec model and the structure of residual vector quantization, the initial channel of the codebooks contains excessive information, making it challenging to directly generate acoustic tokens from weakly supervised signals such as text in downstream tasks. 2) numerous codebooks increases the burden on downstream speech language models. Consequently, leveraging the characteristics of speech language models, we propose Language-Codec. In the Language-Codec, we introduce a Masked Channel Residual Vector Quantization (MCRVQ) mechanism along with improved fourier transform structures and attention blocks, refined discriminator design to address the aforementioned gaps. We compare our method with competing audio compression algorithms and observe significant outperformance across extensive evaluations. Furthermore, we also validate the efficiency of the Language-Codec on downstream speech language models. The source code and pre-trained models can be accessed at https://github.com/jishengpeng/languagecodec .
Paper Structure (24 sections, 13 equations, 2 figures, 10 tables)

This paper contains 24 sections, 13 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: The overall architecture for Language-Codec. On the far left is the encoder downsampling module, which still utilizes the model structure of Encodec. On the far right is the decoder upsampling module, where we have replaced it with Vocos' model structure. The middle part is the Masked Channel Residual Vector Quantization module, with the gray blocks indicating the masked portion of temporal information. The dashed lines within the MCRVQ module indicate that the corresponding representations exhibit a decrease in residual values.
  • Figure 2: The overall architecture of Attention Block and ConvNeXt Blocks inside Decoder. Subfigures (b) and (c) show the more fundamental structure in the Attention Block. The text surrounded by “$<>$” indicates the parameter settings of Conv1d.