DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan
TL;DR
Controllable diffusion-based music generation has been hampered by slow inference-time optimization. DITTO-2 introduces diffusion distillation (CM/CTM) to enable fast 1-step sampling, paired with surrogate optimization that optimizes a latent target over a small number of steps, and a final multi-step decoding pass for high quality. This yields 10–20x speedups while improving control fidelity and audio quality, and extends to text-adherence control by turning unconditional models into effective text-controllable systems via CLAP. The approach enables near real-time, interactive music creation and broadens applications to inpainting, outpainting, structure control, and beyond.
Abstract
Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distilled Diffusion Inference-Time T -Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control and unlock faster-than-real-time generation for a wide-variety of applications such as music inpainting, outpainting, intensity, melody, and musical structure control. Our method works by (1) distilling a pre-trained diffusion model for fast sampling via an efficient, modified consistency or consistency trajectory distillation process (2) performing inference-time optimization using our distilled model with one-step sampling as an efficient surrogate optimization task and (3) running a final multi-step sampling generation (decoding) using our estimated noise latents for best-quality, fast, controllable generation. Through thorough evaluation, we find our method not only speeds up generation over 10-20x, but simultaneously improves control adherence and generation quality all at once. Furthermore, we apply our approach to a new application of maximizing text adherence (CLAP score) and show we can convert an unconditional diffusion model without text inputs into a model that yields state-of-the-art text control. Sound examples can be found at https://ditto-music.github.io/ditto2/.
