Table of Contents
Fetching ...

Towards General Discrete Speech Codec for Complex Acoustic Environments: A Study of Reconstruction and Downstream Task Consistency

Haoran Wang, Guanyu Chen, Bohan Li, Hankun Wang, Yiwei Guo, Zhihan Li, Xie Chen, Kai Yu

TL;DR

The paper addresses the problem that neural speech codecs struggle in complex acoustic environments and can degrade downstream processing. It introduces the Environment-Resilient Speech Codec Benchmark (ERSB), combining a data simulation pipeline with mixtures $M(t)=I(t)*S(t)+N(t)$ and $M(t)=10^{l/20}(I*S + 10^{-0{mu}/20}N)$ to evaluate reconstruction quality and downstream task consistency via SE and ASR backends. Key contributions include a comprehensive ERSB framework, systematic evaluation across multiple codecs (e.g., DAC, EnCodec, SemantiCodec, SpeechTokenizer, X-Codec) and real-world CHiME/DNS data, and insights showing that no codec simultaneously achieves robust reconstruction and task consistency in noisy environments, with DAC performing best for reconstruction but lacking downstream reliability. The findings highlight a critical gap in codec design for real-world deployment and suggest future research directions toward environment-resilient codecs that preserve information essential for SE and ASR under complex acoustics.

Abstract

Neural speech codecs excel in reconstructing clean speech signals; however, their efficacy in complex acoustic environments and downstream signal processing tasks remains underexplored. In this study, we introduce a novel benchmark named Environment-Resilient Speech Codec Benchmark (ERSB) to systematically evaluate whether neural speech codecs are environment-resilient. Specifically, we assess two key capabilities: (1) robust reconstruction, which measures the preservation of both speech and non-speech acoustic details, and (2) downstream task consistency, which ensures minimal deviation in downstream signal processing tasks when using reconstructed speech instead of the original. Our comprehensive experiments reveal that complex acoustic environments significantly degrade signal reconstruction and downstream task consistency. This work highlights the limitations of current speech codecs and raises a future direction that improves them for greater environmental resilience.

Towards General Discrete Speech Codec for Complex Acoustic Environments: A Study of Reconstruction and Downstream Task Consistency

TL;DR

The paper addresses the problem that neural speech codecs struggle in complex acoustic environments and can degrade downstream processing. It introduces the Environment-Resilient Speech Codec Benchmark (ERSB), combining a data simulation pipeline with mixtures and to evaluate reconstruction quality and downstream task consistency via SE and ASR backends. Key contributions include a comprehensive ERSB framework, systematic evaluation across multiple codecs (e.g., DAC, EnCodec, SemantiCodec, SpeechTokenizer, X-Codec) and real-world CHiME/DNS data, and insights showing that no codec simultaneously achieves robust reconstruction and task consistency in noisy environments, with DAC performing best for reconstruction but lacking downstream reliability. The findings highlight a critical gap in codec design for real-world deployment and suggest future research directions toward environment-resilient codecs that preserve information essential for SE and ASR under complex acoustics.

Abstract

Neural speech codecs excel in reconstructing clean speech signals; however, their efficacy in complex acoustic environments and downstream signal processing tasks remains underexplored. In this study, we introduce a novel benchmark named Environment-Resilient Speech Codec Benchmark (ERSB) to systematically evaluate whether neural speech codecs are environment-resilient. Specifically, we assess two key capabilities: (1) robust reconstruction, which measures the preservation of both speech and non-speech acoustic details, and (2) downstream task consistency, which ensures minimal deviation in downstream signal processing tasks when using reconstructed speech instead of the original. Our comprehensive experiments reveal that complex acoustic environments significantly degrade signal reconstruction and downstream task consistency. This work highlights the limitations of current speech codecs and raises a future direction that improves them for greater environmental resilience.

Paper Structure

This paper contains 15 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of ERSB benchmark framework.
  • Figure 2: PESQ/STOI values with respect to SNR and loudness.
  • Figure 3: Consistency of the SE backend on the simulated dataset, measured by $\Delta$SI-SDR.