Table of Contents
Fetching ...

Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach

Eloi Moliner, Filip Elvander, Vesa Välimäki

TL;DR

BABE addresses the ill-posed problem of blind audio bandwidth extension by introducing a zero-shot method that couples diffusion-based priors with a parametrized lowpass model. It jointly infers the unknown degradation and regenerates high-frequency content during diffusion, using a piecewise-linear log-frequency filter with $S$ breakpoints and warm-started sampling to stabilize convergence. The approach outperforms prior blind baselines on objective metrics and delivers strong subjective quality improvements on historical gramophone recordings, while approaching the performance of informed oracle methods on synthetic data. The work demonstrates robust out-of-domain generalization to other instruments and highlights practical considerations for historical restoration, including frame-wise processing and hyperparameter sensitivity.

Abstract

Audio bandwidth extension involves the realistic reconstruction of high-frequency spectra from bandlimited observations. In cases where the lowpass degradation is unknown, such as in restoring historical audio recordings, this becomes a blind problem. This paper introduces a novel method called BABE (Blind Audio Bandwidth Extension) that addresses the blind problem in a zero-shot setting, leveraging the generative priors of a pre-trained unconditional diffusion model. During the inference process, BABE utilizes a generalized version of diffusion posterior sampling, where the degradation operator is unknown but parametrized and inferred iteratively. The performance of the proposed method is evaluated using objective and subjective metrics, and the results show that BABE surpasses state-of-the-art blind bandwidth extension baselines and achieves competitive performance compared to informed methods when tested with synthetic data. Moreover, BABE exhibits robust generalization capabilities when enhancing real historical recordings, effectively reconstructing the missing high-frequency content while maintaining coherence with the original recording. Subjective preference tests confirm that BABE significantly improves the audio quality of historical music recordings. Examples of historical recordings restored with the proposed method are available on the companion webpage: (http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/)

Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach

TL;DR

BABE addresses the ill-posed problem of blind audio bandwidth extension by introducing a zero-shot method that couples diffusion-based priors with a parametrized lowpass model. It jointly infers the unknown degradation and regenerates high-frequency content during diffusion, using a piecewise-linear log-frequency filter with breakpoints and warm-started sampling to stabilize convergence. The approach outperforms prior blind baselines on objective metrics and delivers strong subjective quality improvements on historical gramophone recordings, while approaching the performance of informed oracle methods on synthetic data. The work demonstrates robust out-of-domain generalization to other instruments and highlights practical considerations for historical restoration, including frame-wise processing and hyperparameter sensitivity.

Abstract

Audio bandwidth extension involves the realistic reconstruction of high-frequency spectra from bandlimited observations. In cases where the lowpass degradation is unknown, such as in restoring historical audio recordings, this becomes a blind problem. This paper introduces a novel method called BABE (Blind Audio Bandwidth Extension) that addresses the blind problem in a zero-shot setting, leveraging the generative priors of a pre-trained unconditional diffusion model. During the inference process, BABE utilizes a generalized version of diffusion posterior sampling, where the degradation operator is unknown but parametrized and inferred iteratively. The performance of the proposed method is evaluated using objective and subjective metrics, and the results show that BABE surpasses state-of-the-art blind bandwidth extension baselines and achieves competitive performance compared to informed methods when tested with synthetic data. Moreover, BABE exhibits robust generalization capabilities when enhancing real historical recordings, effectively reconstructing the missing high-frequency content while maintaining coherence with the original recording. Subjective preference tests confirm that BABE significantly improves the audio quality of historical music recordings. Examples of historical recordings restored with the proposed method are available on the companion webpage: (http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/)
Paper Structure (32 sections, 19 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 32 sections, 19 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: Graphical representation of the inference process. (a) The input observations were produced by applying a lowpass filter (red dotted line) to (f) the Ground Truth (GT) reference signal. The proposed method, BABE, iteratively reconstructs the missing high-frequency spectra through a reverse diffusion process (b), (c), (e), while it blindly estimates the lowpass filter degradation (white line overlayed in (b), (c), and (e)). A sampling step is represented in closer detail in (d), where the denoising Deep Neural Network (DNN) is applied, the filter parameters $\phi_i$ are iteratively optimized and the audio data $\mathbf{x}_i$ is updated using reconstruction guidance.
  • Figure 2: Parametric lowpass filter model used in the BABE method ($S=3$).
  • Figure 3: Frequency weighting function that the proposed BABE method applies with the purpose of accelerating and improving the filter inference.
  • Figure 4: Representation of the joint posterior sampling and filter inference, where a single-breakpoint ($S=1$) filter is optimized. The left column (b), (d), (f) shows the denoised estimates $\mathbf{\hat{x}}_0$ at different noise levels $\sigma$, which, altogether with the observations $\mathbf{y}$ (a), were used to compute the cost function $C_\text{filter}$. The right column (c), (e), (g), shows the evolution of the cost function $C_\text{filter}$ with respect to the two parameters (slope $A$ and cutoff frequency $f_\text{c}$), showcasing how the filter estimation becomes more accurate as the inference process proceeds. A high-frequency emphasis filter was used for better visualization of the spectrograms.
  • Figure 5: Diagram of the inference process in a real historical recording. The original recording is firstly denoised before being used as a guiding signal for the generation. Throughout the generation, BABE estimates the (unknown) lowpass degradation of the original recording, here depicted by a magenta line overlay on the spectrograms.
  • ...and 5 more figures