DITTO: Diffusion Inference-Time T-Optimization for Music Generation
Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas J. Bryan
TL;DR
DITTO tackles the challenge of training-free, fine-grained control for pre-trained diffusion models in music generation by optimizing the initial latent x_T at inference using differentiable feature losses. It incorporates gradient checkpointing to manage memory and demonstrates control across inpainting, outpainting, looping, intensity, melody, and musical structure without model fine-tuning. Compared to training-based, guidance-based, and other optimization-based baselines, DITTO achieves state-of-the-art control with better efficiency and preserves audio quality and text relevance. The results reveal that the diffusion latent space encodes rich, controllable, low-frequency content, enabling flexible, rapid experimentation for music generation.
Abstract
We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose frame-work for controlling pre-trained text-to-music diffusion models at inference-time via optimizing initial noise latents. Our method can be used to optimize through any differentiable feature matching loss to achieve a target (stylized) output and leverages gradient checkpointing for memory efficiency. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control - all without ever fine-tuning the underlying model. When we compare our approach against related training, guidance, and optimization-based methods, we find DITTO achieves state-of-the-art performance on nearly all tasks, including outperforming comparable approaches on controllability, audio quality, and computational efficiency, thus opening the door for high-quality, flexible, training-free control of diffusion models. Sound examples can be found at https://DITTO-Music.github.io/web/.
