Table of Contents
Fetching ...

Learning Source Disentanglement in Neural Audio Codec

Xiaoyu Bie, Xubo Liu, Gaël Richard

TL;DR

This work tackles the lack of domain-aware latent representations in neural audio codecs by introducing SD-Codec, a source-disentangled neural audio codec that jointly learns resynthesis and source separation using domain-specific RVQs for speech, music, and SFX. It demonstrates that separating latent features into domain-specific codebooks yields interpretable representations while maintaining competitive reconstruction quality and enabling mixture reconstruction. The approach includes a shared-codebook variant, ablation analyses, and zero-shot evaluation on the DnR dataset, showing strong performance and insights into which RVQ layers encode domain information. The results point to improved explainability and potential for fine-grained control in generative audio systems, with practical bitrate feasibility around 6 kbps per track.

Abstract

Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.

Learning Source Disentanglement in Neural Audio Codec

TL;DR

This work tackles the lack of domain-aware latent representations in neural audio codecs by introducing SD-Codec, a source-disentangled neural audio codec that jointly learns resynthesis and source separation using domain-specific RVQs for speech, music, and SFX. It demonstrates that separating latent features into domain-specific codebooks yields interpretable representations while maintaining competitive reconstruction quality and enabling mixture reconstruction. The approach includes a shared-codebook variant, ablation analyses, and zero-shot evaluation on the DnR dataset, showing strong performance and insights into which RVQ layers encode domain information. The results point to improved explainability and potential for fine-grained control in generative audio systems, with practical bitrate feasibility around 6 kbps per track.

Abstract

Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.
Paper Structure (16 sections, 5 equations, 3 figures, 3 tables)

This paper contains 16 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Source Dissentangled Neural Audio Codec (SD-Codec)
  • Figure 2: SD-Codec with shared codebooks ($R=4$, $S=2$)
  • Figure 3: Single source audio resynthesis using different RVQ modules.