Table of Contents
Fetching ...

Harmonic-Percussive Disentangled Neural Audio Codec for Bandwidth Extension

Benoît Giniès, Xiaoyu Bie, Olivier Fercoq, Gaël Richard

TL;DR

This work tackles bandwidth extension by reconstructing high-frequency content from low-pass audio using a discretized latent space. It introduces HP-codec, a disentangled neural audio codec with harmonic–percussive components across two branches, and HP-codecX, a Transformer-based language model that predicts high-frequency tokens from low-frequency latents. Across MUSDB18 and out-of-domain datasets, HP-codecX achieves state-of-the-art performance on objective metrics and human listening tests, outperforming Apollo and AudioSR. The study highlights the value of aligning codec disentanglement with downstream generative modeling, enabling more accurate high-frequency reconstruction and suggesting practical benefits for bandwidth extension.

Abstract

Bandwidth extension, the task of reconstructing the high-frequency components of an audio signal from its low-pass counterpart, is a long-standing problem in audio processing. While traditional approaches have evolved alongside the broader trends in signal processing, recent advances in neural architectures have significantly improved performance across a wide range of audio tasks, In this work, we extend these advances by framing bandwidth extension as an audio token prediction problem. Specifically, we train a transformer-based language model on the discrete representations produced by a disentangled neural audio codec, where the disentanglement is guided by a Harmonic-Percussive decomposition of the input signals, highlighting spectral structures particularly relevant for bandwidth extension. Our approach introduces a novel codec design that explicitly accounts for the downstream token prediction task, enabling a more effective coupling between codec structure and transformer modeling. This joint design yields high-quality reconstructions of the original signal, as measured by both objective metrics and subjective evaluations. These results highlight the importance of aligning codec disentanglement and representation learning with the generative modeling stage, and demonstrate the potential of global, representation-aware design for advancing bandwidth extension.

Harmonic-Percussive Disentangled Neural Audio Codec for Bandwidth Extension

TL;DR

This work tackles bandwidth extension by reconstructing high-frequency content from low-pass audio using a discretized latent space. It introduces HP-codec, a disentangled neural audio codec with harmonic–percussive components across two branches, and HP-codecX, a Transformer-based language model that predicts high-frequency tokens from low-frequency latents. Across MUSDB18 and out-of-domain datasets, HP-codecX achieves state-of-the-art performance on objective metrics and human listening tests, outperforming Apollo and AudioSR. The study highlights the value of aligning codec disentanglement with downstream generative modeling, enabling more accurate high-frequency reconstruction and suggesting practical benefits for bandwidth extension.

Abstract

Bandwidth extension, the task of reconstructing the high-frequency components of an audio signal from its low-pass counterpart, is a long-standing problem in audio processing. While traditional approaches have evolved alongside the broader trends in signal processing, recent advances in neural architectures have significantly improved performance across a wide range of audio tasks, In this work, we extend these advances by framing bandwidth extension as an audio token prediction problem. Specifically, we train a transformer-based language model on the discrete representations produced by a disentangled neural audio codec, where the disentanglement is guided by a Harmonic-Percussive decomposition of the input signals, highlighting spectral structures particularly relevant for bandwidth extension. Our approach introduces a novel codec design that explicitly accounts for the downstream token prediction task, enabling a more effective coupling between codec structure and transformer modeling. This joint design yields high-quality reconstructions of the original signal, as measured by both objective metrics and subjective evaluations. These results highlight the importance of aligning codec disentanglement and representation learning with the generative modeling stage, and demonstrate the potential of global, representation-aware design for advancing bandwidth extension.

Paper Structure

This paper contains 28 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: HP-codec, our spectrally informed disentangled codec. It is divided in two branches operating at different sampling rates: a 16 kHz branch and a 48 kHz branch. Each branch contains parallel RVQs which are composed of a harmonic section, a percussive section and a residual section.
  • Figure 2: HP-codecX, our bandwidth extension model. It connects the 16 kHz representation, extracted from the input, to the 48 kHz decoder, through a language model organized into three sub-models: a harmonic estimator, a percussive estimator and a residual estimator.
  • Figure 3: Out-of-domain objective reconstruction metrics. These metrics were computed for the Apollo (Apo), AudioSR (Aud) models and HP-codecX (HPX). They have been calculated at 44.1 kHz for the Apollo model, and 48 kHz for the others.
  • Figure 4: Reconstructions metrics of HP-codec, varying the spectral composition of the input (Global, Harmonic, Percussive or Residual), and the sections of the RVQs used for reconstruction. These graphs illustrate the values of Table \ref{['tab-A_codec']}.
  • Figure 5: Objective reconstruction metrics, calculated on whole estimated signals (Global) and high frequency bands of estimated signals (HF). These metrics have been computed on Out-of-Domain test datasets: ENST-Drums, Medley-solos-DB, OrchideaSOL, Monophonic and Polyphonic. The Apollo (Apo) metrics are calculated at 44.1 kHz, while the AudioSR (Aud) and HP-codecX (HPX) metrics have been calculated at 48 kHz. These graphs illustrate the values contained in the Global and HF rows of Table \ref{['tab-language_model_A']}.
  • ...and 2 more figures