Token-based Audio Inpainting via Discrete Diffusion
Tali Dror, Iftach Shoham, Moshe Buchris, Oren Gal, Haim Permuter, Gilad Katz, Eliya Nachmani
TL;DR
This work tackles the challenge of restoring long-missing audio segments by introducing Audio Inpainting via Discrete Diffusion (AIDD), which performs diffusion in a tokenized discrete space instead of raw waveforms or spectrograms. It combines a pre-trained WavTokenizer with a Diffusion Transformer and two novel training strategies—span-based masking and a derivative-based regularization—to produce semantically coherent restorations for gaps up to 750 ms. The method outperforms strong baselines on MusicNet and MAESTRO across medium and long gaps while offering improved efficiency, illustrating the benefits of token-space diffusion for musical audio restoration. These results open avenues for broader applications in music generation and potential extension to speech inpainting and language-model-inspired token-diffusion frameworks.
Abstract
Audio inpainting seeks to restore missing segments in degraded recordings. Previous diffusion-based methods exhibit impaired performance when the missing region is large. We introduce the first approach that applies discrete diffusion over tokenized music representations from a pre-trained audio tokenizer, enabling stable and semantically coherent restoration of long gaps. Our method further incorporates two training approaches: a derivative-based regularization loss that enforces smooth temporal dynamics, and a span-based absorbing transition that provides structured corruption during diffusion. Experiments on the MusicNet and MAESTRO datasets with gaps up to 750 ms show that our approach consistently outperforms strong baselines across range of gap lengths, for gaps of 150 ms and above. This work advances musical audio restoration and introduces new directions for discrete diffusion model training. Audio examples of our proposed method can be found at https://iftach21.github.io/.
