Soft Disentanglement in Frequency Bands for Neural Audio Codecs

Benoit Ginies; Xiaoyu Bie; Olivier Fercoq; Gaël Richard

Soft Disentanglement in Frequency Bands for Neural Audio Codecs

Benoit Ginies, Xiaoyu Bie, Olivier Fercoq, Gaël Richard

TL;DR

The paper tackles the interpretability challenge in neural audio codecs by proposing a soft-disentangled approach that uses spectral decomposition and a two-branch codec operating at $16~\mathrm{kHz}$ and $32~\mathrm{kHz}$. A cascade training regime enforces a soft separation of frequency content while allowing cross-band contributions to enhance reconstruction. Empirical results show modest gains in objective and perceptual quality over a strong DAC baseline, with additional evidence from SDR analyses that the high-frequency branch mainly encodes upper bands but can contribute to lower bands as well. A toy inpainting experiment demonstrates practical utility, suggesting the method's potential for sound transformation and restoration tasks beyond compression.

Abstract

In neural-based audio feature extraction, ensuring that representations capture disentangled information is crucial for model interpretability. However, existing disentanglement methods often rely on assumptions that are highly dependent on data characteristics or specific tasks. In this work, we introduce a generalizable approach for learning disentangled features within a neural architecture. Our method applies spectral decomposition to time-domain signals, followed by a multi-branch audio codec that operates on the decomposed components. Empirical evaluations demonstrate that our approach achieves better reconstruction and perceptual performance compared to a state-of-the-art baseline while also offering potential advantages for inpainting tasks.

Soft Disentanglement in Frequency Bands for Neural Audio Codecs

TL;DR

The paper tackles the interpretability challenge in neural audio codecs by proposing a soft-disentangled approach that uses spectral decomposition and a two-branch codec operating at

and

. A cascade training regime enforces a soft separation of frequency content while allowing cross-band contributions to enhance reconstruction. Empirical results show modest gains in objective and perceptual quality over a strong DAC baseline, with additional evidence from SDR analyses that the high-frequency branch mainly encodes upper bands but can contribute to lower bands as well. A toy inpainting experiment demonstrates practical utility, suggesting the method's potential for sound transformation and restoration tasks beyond compression.

Soft Disentanglement in Frequency Bands for Neural Audio Codecs

TL;DR

Abstract

Soft Disentanglement in Frequency Bands for Neural Audio Codecs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)