Soft Disentanglement in Frequency Bands for Neural Audio Codecs
Benoit Ginies, Xiaoyu Bie, Olivier Fercoq, Gaël Richard
TL;DR
The paper tackles the interpretability challenge in neural audio codecs by proposing a soft-disentangled approach that uses spectral decomposition and a two-branch codec operating at $16~\mathrm{kHz}$ and $32~\mathrm{kHz}$. A cascade training regime enforces a soft separation of frequency content while allowing cross-band contributions to enhance reconstruction. Empirical results show modest gains in objective and perceptual quality over a strong DAC baseline, with additional evidence from SDR analyses that the high-frequency branch mainly encodes upper bands but can contribute to lower bands as well. A toy inpainting experiment demonstrates practical utility, suggesting the method's potential for sound transformation and restoration tasks beyond compression.
Abstract
In neural-based audio feature extraction, ensuring that representations capture disentangled information is crucial for model interpretability. However, existing disentanglement methods often rely on assumptions that are highly dependent on data characteristics or specific tasks. In this work, we introduce a generalizable approach for learning disentangled features within a neural architecture. Our method applies spectral decomposition to time-domain signals, followed by a multi-branch audio codec that operates on the decomposed components. Empirical evaluations demonstrate that our approach achieves better reconstruction and perceptual performance compared to a state-of-the-art baseline while also offering potential advantages for inpainting tasks.
