Table of Contents
Fetching ...

Entropy-Guided GRVQ for Ultra-Low Bitrate Neural Speech Codec

Yanzhou Ren, Noboru Harada, Daiki Takeuchi, Siyu Chen, Wei Liu, Xiao Zhang, Liyuan Zhang, Takehiro Moriya, Shoji Makino

TL;DR

An entropy-guided group residual vector quantization (EG-GRVQ) is proposed for an ultra-low bitrate neural speech codec, which retains a semantic branch for linguistic information and incorporates an entropy-guided grouping strategy in the acoustic branch.

Abstract

Neural audio codec (NAC) is essential for reconstructing high-quality speech signals and generating discrete representations for downstream speech language models. However, ensuring accurate semantic modeling while maintaining high-fidelity reconstruction under ultra-low bitrate constraints remains challenging. We propose an entropy-guided group residual vector quantization (EG-GRVQ) for an ultra-low bitrate neural speech codec, which retains a semantic branch for linguistic information and incorporates an entropy-guided grouping strategy in the acoustic branch. Assuming that channel activations follow approximately Gaussian statistics, the variance of each channel can serve as a principled proxy for its information content. Based on this assumption, we partition the encoder output such that each group carries an equal share of the total information. This balanced allocation improves codebook efficiency and reduces redundancy. Trained on LibriTTS and VCTK, our model shows improvements in perceptual quality and intelligibility metrics under ultra-low bitrate conditions, with a focus on codec-level fidelity for communication-oriented scenarios.

Entropy-Guided GRVQ for Ultra-Low Bitrate Neural Speech Codec

TL;DR

An entropy-guided group residual vector quantization (EG-GRVQ) is proposed for an ultra-low bitrate neural speech codec, which retains a semantic branch for linguistic information and incorporates an entropy-guided grouping strategy in the acoustic branch.

Abstract

Neural audio codec (NAC) is essential for reconstructing high-quality speech signals and generating discrete representations for downstream speech language models. However, ensuring accurate semantic modeling while maintaining high-fidelity reconstruction under ultra-low bitrate constraints remains challenging. We propose an entropy-guided group residual vector quantization (EG-GRVQ) for an ultra-low bitrate neural speech codec, which retains a semantic branch for linguistic information and incorporates an entropy-guided grouping strategy in the acoustic branch. Assuming that channel activations follow approximately Gaussian statistics, the variance of each channel can serve as a principled proxy for its information content. Based on this assumption, we partition the encoder output such that each group carries an equal share of the total information. This balanced allocation improves codebook efficiency and reduces redundancy. Trained on LibriTTS and VCTK, our model shows improvements in perceptual quality and intelligibility metrics under ultra-low bitrate conditions, with a focus on codec-level fidelity for communication-oriented scenarios.
Paper Structure (12 sections, 5 equations, 5 figures, 3 tables)

This paper contains 12 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Structure of the proposed model.
  • Figure 2: Quantizer structure configuration for (a) RVQ, (b) GRVQ, and (c) EG-GRVQ (Proposal).
  • Figure 3: Codebook utilization rate in acoustic branch.
  • Figure 4: MUSHRA score distributions of different methods.
  • Figure 5: Mean MUSHRA score differences between Proposal (EG-GRVQ) and baselines with 95% confidence intervals