Table of Contents
Fetching ...

Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation

Haohan Guo, Fenglong Xie, Dongchao Yang, Xixin Wu, Helen Meng

TL;DR

The paper tackles recency bias in CLM-based TTS, caused by long discrete speech token sequences, by introducing CoFi-Speech, a coarse-to-fine framework with a multi-scale speech codec (CoFi-Codec) and two LM-based generation strategies (CoFi-LM: chain-of-scale and stack-of-scale). The authors demonstrate that multi-scale coding and hierarchical generation yield superior naturalness and speaker similarity, with SoS achieving the best results on zero-shot TTS in a Chinese dataset and attention analyses showing improved context modeling. Key components include SWND to prevent high-scale collapse, GAN-based training for speech reconstruction, and in-context learning for zero-shot prompts. The results suggest that explicitly modeling information at multiple temporal scales substantially improves stability and quality in neural codec language-model-based TTS, with broad implications for robust, out-of-domain TTS systems.

Abstract

The neural codec language model (CLM) has demonstrated remarkable performance in text-to-speech (TTS) synthesis. However, troubled by ``recency bias", CLM lacks sufficient attention to coarse-grained information at a higher temporal scale, often producing unnatural or even unintelligible speech. This work proposes CoFi-Speech, a coarse-to-fine CLM-TTS approach, employing multi-scale speech coding and generation to address this issue. We train a multi-scale neural codec, CoFi-Codec, to encode speech into a multi-scale discrete representation, comprising multiple token sequences with different time resolutions. Then, we propose CoFi-LM that can generate this representation in two modes: the single-LM-based chain-of-scale generation and the multiple-LM-based stack-of-scale generation. In experiments, CoFi-Speech significantly outperforms single-scale baseline systems on naturalness and speaker similarity in zero-shot TTS. The analysis of multi-scale coding demonstrates the effectiveness of CoFi-Codec in learning multi-scale discrete speech representations while keeping high-quality speech reconstruction. The coarse-to-fine multi-scale generation, especially for the stack-of-scale approach, is also validated as a crucial approach in pursuing a high-quality neural codec language model for TTS.

Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation

TL;DR

The paper tackles recency bias in CLM-based TTS, caused by long discrete speech token sequences, by introducing CoFi-Speech, a coarse-to-fine framework with a multi-scale speech codec (CoFi-Codec) and two LM-based generation strategies (CoFi-LM: chain-of-scale and stack-of-scale). The authors demonstrate that multi-scale coding and hierarchical generation yield superior naturalness and speaker similarity, with SoS achieving the best results on zero-shot TTS in a Chinese dataset and attention analyses showing improved context modeling. Key components include SWND to prevent high-scale collapse, GAN-based training for speech reconstruction, and in-context learning for zero-shot prompts. The results suggest that explicitly modeling information at multiple temporal scales substantially improves stability and quality in neural codec language-model-based TTS, with broad implications for robust, out-of-domain TTS systems.

Abstract

The neural codec language model (CLM) has demonstrated remarkable performance in text-to-speech (TTS) synthesis. However, troubled by ``recency bias", CLM lacks sufficient attention to coarse-grained information at a higher temporal scale, often producing unnatural or even unintelligible speech. This work proposes CoFi-Speech, a coarse-to-fine CLM-TTS approach, employing multi-scale speech coding and generation to address this issue. We train a multi-scale neural codec, CoFi-Codec, to encode speech into a multi-scale discrete representation, comprising multiple token sequences with different time resolutions. Then, we propose CoFi-LM that can generate this representation in two modes: the single-LM-based chain-of-scale generation and the multiple-LM-based stack-of-scale generation. In experiments, CoFi-Speech significantly outperforms single-scale baseline systems on naturalness and speaker similarity in zero-shot TTS. The analysis of multi-scale coding demonstrates the effectiveness of CoFi-Codec in learning multi-scale discrete speech representations while keeping high-quality speech reconstruction. The coarse-to-fine multi-scale generation, especially for the stack-of-scale approach, is also validated as a crucial approach in pursuing a high-quality neural codec language model for TTS.
Paper Structure (13 sections, 1 equation, 4 figures, 1 table)

This paper contains 13 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: The system architecture of a three-scale CoFi-Speech, comprising: a CoFi-Codec for speech encoding and decoding, and two types of CoFi-LM for text-to-speech generation. The trainable modules are colored red, and each VQ operation comprises a trainable codebook.
  • Figure 2: The objective evaluation of multi-scale coding with or without scale-wise nested dropout (SWND).
  • Figure 3: The objective evaluation of multi-scale generation.
  • Figure 4: The aggregated attention maps of different LMs. For CoFi-Speech (SoS), maps displayed from left to right refer to $\text{GPT}_3$, $\text{GPT}_2$, and $\text{GPT}_1$.