Table of Contents
Fetching ...

High-Fidelity Speech Enhancement via Discrete Audio Tokens

Luca A. Lanzendörfer, Frédéric Berdoz, Antonis Asonitis, Roger Wattenhofer

TL;DR

This work addresses high-fidelity speech enhancement and bandwidth extension without multi-stage pipelines by leveraging discrete 44.1 kHz DAC tokens in a single autoregressive LM. It introduces DAC-SE1, a 1B-parameter LLaMA-based model that ingests flattened DAC token streams and yields clean, bandwidth-extended speech. The paper demonstrates state-of-the-art objective metrics and strong MUSHRA scores on PLC and DNS benchmarks, outperforming prior LM-based SE baselines. By releasing code and checkpoints, it supports scalable, unified high-quality SE research.

Abstract

Recent autoregressive transformer-based speech enhancement (SE) methods have shown promising results by leveraging advanced semantic understanding and contextual modeling of speech. However, these approaches often rely on complex multi-stage pipelines and low sampling rate codecs, limiting them to narrow and task-specific speech enhancement. In this work, we introduce DAC-SE1, a simplified language model-based SE framework leveraging discrete high-resolution audio representations; DAC-SE1 preserves fine-grained acoustic details while maintaining semantic coherence. Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation. We release our codebase and model checkpoints to support further research in scalable, unified, and high-quality speech enhancement.

High-Fidelity Speech Enhancement via Discrete Audio Tokens

TL;DR

This work addresses high-fidelity speech enhancement and bandwidth extension without multi-stage pipelines by leveraging discrete 44.1 kHz DAC tokens in a single autoregressive LM. It introduces DAC-SE1, a 1B-parameter LLaMA-based model that ingests flattened DAC token streams and yields clean, bandwidth-extended speech. The paper demonstrates state-of-the-art objective metrics and strong MUSHRA scores on PLC and DNS benchmarks, outperforming prior LM-based SE baselines. By releasing code and checkpoints, it supports scalable, unified high-quality SE research.

Abstract

Recent autoregressive transformer-based speech enhancement (SE) methods have shown promising results by leveraging advanced semantic understanding and contextual modeling of speech. However, these approaches often rely on complex multi-stage pipelines and low sampling rate codecs, limiting them to narrow and task-specific speech enhancement. In this work, we introduce DAC-SE1, a simplified language model-based SE framework leveraging discrete high-resolution audio representations; DAC-SE1 preserves fine-grained acoustic details while maintaining semantic coherence. Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation. We release our codebase and model checkpoints to support further research in scalable, unified, and high-quality speech enhancement.

Paper Structure

This paper contains 13 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Qualitative comparison on log-mel spectrograms between our proposed method (DAC-SE1) and previous autoregressive speech enhanecment methods. DAC-SE1 is able to clean the signal without hallucinating artifacts or spectral distortion.
  • Figure 2: Overview of DAC-SE1 framework for high-fidelity speech enhancement and bandwidth extension. Previous work mostly uses a continuous speech representation as the input to the autoregressive model (e.g., HuBERT or WavLM) and then predicts tokens from a Neural Speech Codec (NSC). These models are limited to 16 kHz signals. Our approach does not require semantic representations and only leverages the compressed representation of Neural Audio Codecs (NAC). We use the DAC model, compressing a 44.1 kHz signal into 9 codebook layers at 86 Hz framerate. We flatten this sequence into $9\cdot86$ tokens per second which are translated by our LlaMa-based model into clean speech in the DAC token space, which can then be reconstructed using the DAC decoder.