Table of Contents
Fetching ...

DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation

Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan

TL;DR

Controllable diffusion-based music generation has been hampered by slow inference-time optimization. DITTO-2 introduces diffusion distillation (CM/CTM) to enable fast 1-step sampling, paired with surrogate optimization that optimizes a latent target over a small number of steps, and a final multi-step decoding pass for high quality. This yields 10–20x speedups while improving control fidelity and audio quality, and extends to text-adherence control by turning unconditional models into effective text-controllable systems via CLAP. The approach enables near real-time, interactive music creation and broadens applications to inpainting, outpainting, structure control, and beyond.

Abstract

Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distilled Diffusion Inference-Time T -Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control and unlock faster-than-real-time generation for a wide-variety of applications such as music inpainting, outpainting, intensity, melody, and musical structure control. Our method works by (1) distilling a pre-trained diffusion model for fast sampling via an efficient, modified consistency or consistency trajectory distillation process (2) performing inference-time optimization using our distilled model with one-step sampling as an efficient surrogate optimization task and (3) running a final multi-step sampling generation (decoding) using our estimated noise latents for best-quality, fast, controllable generation. Through thorough evaluation, we find our method not only speeds up generation over 10-20x, but simultaneously improves control adherence and generation quality all at once. Furthermore, we apply our approach to a new application of maximizing text adherence (CLAP score) and show we can convert an unconditional diffusion model without text inputs into a model that yields state-of-the-art text control. Sound examples can be found at https://ditto-music.github.io/ditto2/.

DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation

TL;DR

Controllable diffusion-based music generation has been hampered by slow inference-time optimization. DITTO-2 introduces diffusion distillation (CM/CTM) to enable fast 1-step sampling, paired with surrogate optimization that optimizes a latent target over a small number of steps, and a final multi-step decoding pass for high quality. This yields 10–20x speedups while improving control fidelity and audio quality, and extends to text-adherence control by turning unconditional models into effective text-controllable systems via CLAP. The approach enables near real-time, interactive music creation and broadens applications to inpainting, outpainting, structure control, and beyond.

Abstract

Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distilled Diffusion Inference-Time T -Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control and unlock faster-than-real-time generation for a wide-variety of applications such as music inpainting, outpainting, intensity, melody, and musical structure control. Our method works by (1) distilling a pre-trained diffusion model for fast sampling via an efficient, modified consistency or consistency trajectory distillation process (2) performing inference-time optimization using our distilled model with one-step sampling as an efficient surrogate optimization task and (3) running a final multi-step sampling generation (decoding) using our estimated noise latents for best-quality, fast, controllable generation. Through thorough evaluation, we find our method not only speeds up generation over 10-20x, but simultaneously improves control adherence and generation quality all at once. Furthermore, we apply our approach to a new application of maximizing text adherence (CLAP score) and show we can convert an unconditional diffusion model without text inputs into a model that yields state-of-the-art text control. Sound examples can be found at https://ditto-music.github.io/ditto2/.
Paper Structure (20 sections, 4 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 4 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: DITTO-2: Distilled Diffusion Inference-Time $\boldsymbol{T}$-Optimization. We speed up diffusion inference-time optimization-based music generation by 10-20x while improving control and audio quality. (Top) We use diffusion distillation to speed up performance (optimize with 1-step sampling). (Bottom) We then run multi-step sampling for final higher-quality generation (decoding).
  • Figure 2: (Top) Baseline DITTO runs optimization over a multi-step sampling process to find an initial noise latent to achieve a desired stylized output, incurring a large speed cost. (Bottom) When generating the final output (decoding), the same multi-step diffusion sampling process is used.
  • Figure 3: CTM Distillation for DITTO-2. We distill $\bm{G}_\phi$ by minimizing the distance between the jump from $\bm{x}_t$ to $\bm{x}_s$ and $\bm{x}_{t-1}$ to $\bm{x}_s$, where $\bm{x}_{t-1}$ is generated by sampling with the base model $\bm\epsilon_\theta$.
  • Figure 4: DITTO-2 inference speed vs. control MSE vs. audio quality (FAD, denoted by size, smaller is better). Dashed line denotes the cutoff for real-time performance, color denotes ITO method, and subscripts denote number of sampling steps during optimization / final decoding. Applied to intensity control. Trends also hold for CLAP score.
  • Figure 5: FAD, MSE, and CLAP results on Intensity Control for 1-step optimization, where orange lines denote baseline 20-step performance. MSE increases with more decoding steps for both CM/CTM given the domain gap though beats the baseline with $<4$ steps. CM is unable to beat baseline quality due to accumulated errors in multi-step sampling, while multi-step CTM achieves SOTA quality.