Modeling strategies for speech enhancement in the latent space of a neural audio codec

Sofiene Kammoun; Xavier Alameda-Pineda; Simon Leglaive

Modeling strategies for speech enhancement in the latent space of a neural audio codec

Sofiene Kammoun, Xavier Alameda-Pineda, Simon Leglaive

TL;DR

This study systematically compares speech enhancement models that operate in the latent space of neural audio codecs, contrasting continuous latent representations with discrete tokens and evaluating autoregressive versus non-autoregressive architectures. Using a Conformer-based framework and the Descript Audio Codec on Libri1Mix data, the authors show that continuous representations consistently yield higher enhancement quality than discrete ones, while autoregressive models provide better quality at the cost of intelligibility and efficiency. Encoder fine-tuning emerges as a strong baseline, achieving the best trade-off between SE metrics and codec fidelity, though it can degrade the codec's reconstruction. The findings offer practical guidance for choosing latent representations and modeling strategies in NAC-based SE, with implications for telecommunication systems and codec-compatible deployments.

Abstract

Neural audio codecs (NACs) provide compact latent speech representations in the form of sequences of continuous vectors or discrete tokens. In this work, we investigate how these two types of speech representations compare when used as training targets for supervised speech enhancement. We consider both autoregressive and non-autoregressive speech enhancement models based on the Conformer architecture, as well as a simple baseline where the NAC encoder is simply fine-tuned for speech enhancement. Our experiments reveal three key findings: predicting continuous latent representations consistently outperforms discrete token prediction; autoregressive models achieve higher quality but at the expense of intelligibility and efficiency, making non-autoregressive models more attractive in practice; and encoder fine-tuning yields the strongest enhancement metrics overall, though at the cost of degraded codec reconstruction. The code and audio samples are available online.

Modeling strategies for speech enhancement in the latent space of a neural audio codec

TL;DR

Abstract

Modeling strategies for speech enhancement in the latent space of a neural audio codec

TL;DR

Abstract

Paper Structure

Table of Contents