Table of Contents
Fetching ...

Token-based Audio Inpainting via Discrete Diffusion

Tali Dror, Iftach Shoham, Moshe Buchris, Oren Gal, Haim Permuter, Gilad Katz, Eliya Nachmani

TL;DR

This work tackles the challenge of restoring long-missing audio segments by introducing Audio Inpainting via Discrete Diffusion (AIDD), which performs diffusion in a tokenized discrete space instead of raw waveforms or spectrograms. It combines a pre-trained WavTokenizer with a Diffusion Transformer and two novel training strategies—span-based masking and a derivative-based regularization—to produce semantically coherent restorations for gaps up to 750 ms. The method outperforms strong baselines on MusicNet and MAESTRO across medium and long gaps while offering improved efficiency, illustrating the benefits of token-space diffusion for musical audio restoration. These results open avenues for broader applications in music generation and potential extension to speech inpainting and language-model-inspired token-diffusion frameworks.

Abstract

Audio inpainting seeks to restore missing segments in degraded recordings. Previous diffusion-based methods exhibit impaired performance when the missing region is large. We introduce the first approach that applies discrete diffusion over tokenized music representations from a pre-trained audio tokenizer, enabling stable and semantically coherent restoration of long gaps. Our method further incorporates two training approaches: a derivative-based regularization loss that enforces smooth temporal dynamics, and a span-based absorbing transition that provides structured corruption during diffusion. Experiments on the MusicNet and MAESTRO datasets with gaps up to 750 ms show that our approach consistently outperforms strong baselines across range of gap lengths, for gaps of 150 ms and above. This work advances musical audio restoration and introduces new directions for discrete diffusion model training. Audio examples of our proposed method can be found at https://iftach21.github.io/.

Token-based Audio Inpainting via Discrete Diffusion

TL;DR

This work tackles the challenge of restoring long-missing audio segments by introducing Audio Inpainting via Discrete Diffusion (AIDD), which performs diffusion in a tokenized discrete space instead of raw waveforms or spectrograms. It combines a pre-trained WavTokenizer with a Diffusion Transformer and two novel training strategies—span-based masking and a derivative-based regularization—to produce semantically coherent restorations for gaps up to 750 ms. The method outperforms strong baselines on MusicNet and MAESTRO across medium and long gaps while offering improved efficiency, illustrating the benefits of token-space diffusion for musical audio restoration. These results open avenues for broader applications in music generation and potential extension to speech inpainting and language-model-inspired token-diffusion frameworks.

Abstract

Audio inpainting seeks to restore missing segments in degraded recordings. Previous diffusion-based methods exhibit impaired performance when the missing region is large. We introduce the first approach that applies discrete diffusion over tokenized music representations from a pre-trained audio tokenizer, enabling stable and semantically coherent restoration of long gaps. Our method further incorporates two training approaches: a derivative-based regularization loss that enforces smooth temporal dynamics, and a span-based absorbing transition that provides structured corruption during diffusion. Experiments on the MusicNet and MAESTRO datasets with gaps up to 750 ms show that our approach consistently outperforms strong baselines across range of gap lengths, for gaps of 150 ms and above. This work advances musical audio restoration and introduces new directions for discrete diffusion model training. Audio examples of our proposed method can be found at https://iftach21.github.io/.

Paper Structure

This paper contains 38 sections, 9 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Our method operates on audio signals with missing (silent) segments. During inference, the input waveform, containing a silence gap is processed by the WavTokenizer encoder, which converts the audio into a discrete sequence of tokens. Next, a DiT performs inpainting by iteratively predicting the masked tokens, resulting in reconstructed token sequence. Finally, the reconstructed tokens are passed through the WavTokenizer’s decoder to synthesize the output audio waveform in the masked part. During training, token sequences are corrupted with span-based masking at randomly sampled timesteps, and the DiT is optimized to predict the concrete score using the DWDSE objective, complemented by the derivative-based loss.