Table of Contents
Fetching ...

Soft Disentanglement in Frequency Bands for Neural Audio Codecs

Benoit Ginies, Xiaoyu Bie, Olivier Fercoq, Gaël Richard

TL;DR

The paper tackles the interpretability challenge in neural audio codecs by proposing a soft-disentangled approach that uses spectral decomposition and a two-branch codec operating at $16~\mathrm{kHz}$ and $32~\mathrm{kHz}$. A cascade training regime enforces a soft separation of frequency content while allowing cross-band contributions to enhance reconstruction. Empirical results show modest gains in objective and perceptual quality over a strong DAC baseline, with additional evidence from SDR analyses that the high-frequency branch mainly encodes upper bands but can contribute to lower bands as well. A toy inpainting experiment demonstrates practical utility, suggesting the method's potential for sound transformation and restoration tasks beyond compression.

Abstract

In neural-based audio feature extraction, ensuring that representations capture disentangled information is crucial for model interpretability. However, existing disentanglement methods often rely on assumptions that are highly dependent on data characteristics or specific tasks. In this work, we introduce a generalizable approach for learning disentangled features within a neural architecture. Our method applies spectral decomposition to time-domain signals, followed by a multi-branch audio codec that operates on the decomposed components. Empirical evaluations demonstrate that our approach achieves better reconstruction and perceptual performance compared to a state-of-the-art baseline while also offering potential advantages for inpainting tasks.

Soft Disentanglement in Frequency Bands for Neural Audio Codecs

TL;DR

The paper tackles the interpretability challenge in neural audio codecs by proposing a soft-disentangled approach that uses spectral decomposition and a two-branch codec operating at and . A cascade training regime enforces a soft separation of frequency content while allowing cross-band contributions to enhance reconstruction. Empirical results show modest gains in objective and perceptual quality over a strong DAC baseline, with additional evidence from SDR analyses that the high-frequency branch mainly encodes upper bands but can contribute to lower bands as well. A toy inpainting experiment demonstrates practical utility, suggesting the method's potential for sound transformation and restoration tasks beyond compression.

Abstract

In neural-based audio feature extraction, ensuring that representations capture disentangled information is crucial for model interpretability. However, existing disentanglement methods often rely on assumptions that are highly dependent on data characteristics or specific tasks. In this work, we introduce a generalizable approach for learning disentangled features within a neural architecture. Our method applies spectral decomposition to time-domain signals, followed by a multi-branch audio codec that operates on the decomposed components. Empirical evaluations demonstrate that our approach achieves better reconstruction and perceptual performance compared to a state-of-the-art baseline while also offering potential advantages for inpainting tasks.

Paper Structure

This paper contains 13 sections, 4 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Proposed disentangled codec. The $16~kHz$ branch reconstructs the $[0-8~kHz]$ signal. The $32~kHz$ branch processes the residual of $S_{32kHz}$ and $U(\hat{d}_{16kHz})$ to output the $[0-16~kHz]$ signal summing the outputs from each branch.
  • Figure 2: Spectrograms of $S_{32kHz}$, $U(\hat{d}_{16kHz})$ and $\hat{d}_{32kHz}$. $U(\hat{d}_{16kHz})$ only encodes information in the $[0-8~kHz]$ band. $\hat{d}_{32kHz}$ has most of its energy in the $[8-16~kHz]$ band, even though it also carries residual information in the lower band.