ComplexDec: A Domain-robust High-fidelity Neural Audio Codec with Complex Spectrum Modeling
Yi-Chiao Wu, Dejan Marković, Steven Krenn, Israel D. Gebru, Alexander Richard
TL;DR
This work tackles the fragility of neural audio codecs when encountering out-of-domain audio due to information loss from temporal and embedding-dimension compressions. It introduces ComplexDec, a full-band 48 kHz neural codec that operates in the complex spectral domain with two RVQAE paths and a diffusion-based SPF post-filter, all at 24 kbps. Trained on about 30 hours of reading-style data and evaluated against expressive out-of-domain speech, ComplexDec demonstrates strong robustness, maintaining comparable in-domain and out-of-domain performance while outperforming waveform-based baselines and several open-source models in perceptual quality. The results indicate that reducing temporal/dimensional compression via complex-spectrum modeling is a promising direction for domain-robust neural codecs, with potential for efficient integration into audio-generation systems.
Abstract
Neural audio codecs have been widely adopted in audio-generative tasks because their compact and discrete representations are suitable for both large-language-model-style and regression-based generative models. However, most neural codecs struggle to model out-of-domain audio, resulting in error propagations to downstream generative tasks. In this paper, we first argue that information loss from codec compression degrades out-of-domain robustness. Then, we propose full-band 48~kHz ComplexDec with complex spectral input and output to ease the information loss while adopting the same 24~kbps bitrate as the baseline AuidoDec and ScoreDec. Objective and subjective evaluations demonstrate the out-of-domain robustness of ComplexDec trained using only the 30-hour VCTK corpus.
