Table of Contents
Fetching ...

Diffusion-Based Audio Inpainting

Eloi Moliner, Vesa Välimäki

TL;DR

This work investigates diffusion-based audio inpainting to reconstruct missing segments, addressing the limitations of traditional methods for longer gaps by using an unconditional diffusion generator that can be conditioned in a zero-shot manner. It introduces CQT-Diff+, an improved diffusion model that operates in an invertible constant-Q transform domain, leveraging pitch-equivariant structure and timewise self-attention to produce coherent reconstructions up to $300\,\mathrm{ms}$. Through objective metrics (LSD, ODG, FAD) and a MUSHRA-style subjective study on MusicNet, the method achieves performance on par with or better than strong baselines for short gaps and significantly outperforms them for mid-sized gaps, demonstrating practical viability for restoring disturbed audio. Limitations include the reliance on classical-music training data and the potential gains from conditioning on high-level structure or auxiliary signals for very long gaps; future work points to conditional diffusion approaches and broader audio domains. The approach offers a scalable, high-quality option for repairing local dropouts in recordings, with potential applications in archival restoration and audio production.

Abstract

Audio inpainting aims to reconstruct missing segments in corrupted recordings. Most of existing methods produce plausible reconstructions when the gap lengths are short, but struggle to reconstruct gaps larger than about 100 ms. This paper explores recent advancements in deep learning and, particularly, diffusion models, for the task of audio inpainting. The proposed method uses an unconditionally trained generative model, which can be conditioned in a zero-shot fashion for audio inpainting, and is able to regenerate gaps of any size. An improved deep neural network architecture based on the constant-Q transform, which allows the model to exploit pitch-equivariant symmetries in audio, is also presented. The performance of the proposed algorithm is evaluated through objective and subjective metrics for the task of reconstructing short to mid-sized gaps, up to 300 ms. The results of a formal listening test show that the proposed method delivers comparable performance against the compared baselines for short gaps, such as 50 ms, while retaining a good audio quality and outperforming the baselines for wider gaps that are up to 300 ms long. The method presented in this paper can be applied to restoring sound recordings that suffer from severe local disturbances or dropouts, which must be reconstructed.

Diffusion-Based Audio Inpainting

TL;DR

This work investigates diffusion-based audio inpainting to reconstruct missing segments, addressing the limitations of traditional methods for longer gaps by using an unconditional diffusion generator that can be conditioned in a zero-shot manner. It introduces CQT-Diff+, an improved diffusion model that operates in an invertible constant-Q transform domain, leveraging pitch-equivariant structure and timewise self-attention to produce coherent reconstructions up to . Through objective metrics (LSD, ODG, FAD) and a MUSHRA-style subjective study on MusicNet, the method achieves performance on par with or better than strong baselines for short gaps and significantly outperforms them for mid-sized gaps, demonstrating practical viability for restoring disturbed audio. Limitations include the reliance on classical-music training data and the potential gains from conditioning on high-level structure or auxiliary signals for very long gaps; future work points to conditional diffusion approaches and broader audio domains. The approach offers a scalable, high-quality option for repairing local dropouts in recordings, with potential applications in archival restoration and audio production.

Abstract

Audio inpainting aims to reconstruct missing segments in corrupted recordings. Most of existing methods produce plausible reconstructions when the gap lengths are short, but struggle to reconstruct gaps larger than about 100 ms. This paper explores recent advancements in deep learning and, particularly, diffusion models, for the task of audio inpainting. The proposed method uses an unconditionally trained generative model, which can be conditioned in a zero-shot fashion for audio inpainting, and is able to regenerate gaps of any size. An improved deep neural network architecture based on the constant-Q transform, which allows the model to exploit pitch-equivariant symmetries in audio, is also presented. The performance of the proposed algorithm is evaluated through objective and subjective metrics for the task of reconstructing short to mid-sized gaps, up to 300 ms. The results of a formal listening test show that the proposed method delivers comparable performance against the compared baselines for short gaps, such as 50 ms, while retaining a good audio quality and outperforming the baselines for wider gaps that are up to 300 ms long. The method presented in this paper can be applied to restoring sound recordings that suffer from severe local disturbances or dropouts, which must be reconstructed.
Paper Structure (23 sections, 14 equations, 7 figures, 1 algorithm)

This paper contains 23 sections, 14 equations, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: Inference block diagram for audio inpainting, where all straight lines represent a feedforward signal flow in the time domain. The deep neural network is included in the denoiser block. The computation of the reconstruction gradient requires differentiating through the mask and the denoiser block by means of backpropagation, denoted as "backprop." above, requiring a backward pass through the deep neural network, illustrated here with a dotted line. The spectrograms are shown for illustrative purposes.
  • Figure 2: Main diagram of the CQT-U-Net deep neural network architecture. In the diagram, only three octaves of eight are shown for clarity. The sizes of the spectrograms are not proportional to the real signals.
  • Figure 3: Building blocks of the backbone U-Net architecture, cf. Fig. \ref{['fig:architecture']}.
  • Figure 4: Timewise self-attention block used in Fig. \ref{['fig:build_blocks']}.
  • Figure 5: Average objective metrics, including (a) log-spectral distance (LSD), (b) Objective Difference Grades (ODG), and (c) the Fréchet Audio Distance (FAD), computed for various gap lengths from 25 to 300 ms. Lower is better for LSD and FAD, whereas higher is better for ODG. The proposed method (CQT-Diff+) obtained competitive results against the baselines in the reference-based metrics LSD and ODG, while being superior in terms of LSD.
  • ...and 2 more figures