High-Fidelity Speech Enhancement via Discrete Audio Tokens
Luca A. Lanzendörfer, Frédéric Berdoz, Antonis Asonitis, Roger Wattenhofer
TL;DR
This work addresses high-fidelity speech enhancement and bandwidth extension without multi-stage pipelines by leveraging discrete 44.1 kHz DAC tokens in a single autoregressive LM. It introduces DAC-SE1, a 1B-parameter LLaMA-based model that ingests flattened DAC token streams and yields clean, bandwidth-extended speech. The paper demonstrates state-of-the-art objective metrics and strong MUSHRA scores on PLC and DNS benchmarks, outperforming prior LM-based SE baselines. By releasing code and checkpoints, it supports scalable, unified high-quality SE research.
Abstract
Recent autoregressive transformer-based speech enhancement (SE) methods have shown promising results by leveraging advanced semantic understanding and contextual modeling of speech. However, these approaches often rely on complex multi-stage pipelines and low sampling rate codecs, limiting them to narrow and task-specific speech enhancement. In this work, we introduce DAC-SE1, a simplified language model-based SE framework leveraging discrete high-resolution audio representations; DAC-SE1 preserves fine-grained acoustic details while maintaining semantic coherence. Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation. We release our codebase and model checkpoints to support further research in scalable, unified, and high-quality speech enhancement.
