Table of Contents
Fetching ...

DITTO: Diffusion Inference-Time T-Optimization for Music Generation

Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas J. Bryan

TL;DR

DITTO tackles the challenge of training-free, fine-grained control for pre-trained diffusion models in music generation by optimizing the initial latent x_T at inference using differentiable feature losses. It incorporates gradient checkpointing to manage memory and demonstrates control across inpainting, outpainting, looping, intensity, melody, and musical structure without model fine-tuning. Compared to training-based, guidance-based, and other optimization-based baselines, DITTO achieves state-of-the-art control with better efficiency and preserves audio quality and text relevance. The results reveal that the diffusion latent space encodes rich, controllable, low-frequency content, enabling flexible, rapid experimentation for music generation.

Abstract

We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose frame-work for controlling pre-trained text-to-music diffusion models at inference-time via optimizing initial noise latents. Our method can be used to optimize through any differentiable feature matching loss to achieve a target (stylized) output and leverages gradient checkpointing for memory efficiency. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control - all without ever fine-tuning the underlying model. When we compare our approach against related training, guidance, and optimization-based methods, we find DITTO achieves state-of-the-art performance on nearly all tasks, including outperforming comparable approaches on controllability, audio quality, and computational efficiency, thus opening the door for high-quality, flexible, training-free control of diffusion models. Sound examples can be found at https://DITTO-Music.github.io/web/.

DITTO: Diffusion Inference-Time T-Optimization for Music Generation

TL;DR

DITTO tackles the challenge of training-free, fine-grained control for pre-trained diffusion models in music generation by optimizing the initial latent x_T at inference using differentiable feature losses. It incorporates gradient checkpointing to manage memory and demonstrates control across inpainting, outpainting, looping, intensity, melody, and musical structure without model fine-tuning. Compared to training-based, guidance-based, and other optimization-based baselines, DITTO achieves state-of-the-art control with better efficiency and preserves audio quality and text relevance. The results reveal that the diffusion latent space encodes rich, controllable, low-frequency content, enabling flexible, rapid experimentation for music generation.

Abstract

We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose frame-work for controlling pre-trained text-to-music diffusion models at inference-time via optimizing initial noise latents. Our method can be used to optimize through any differentiable feature matching loss to achieve a target (stylized) output and leverages gradient checkpointing for memory efficiency. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control - all without ever fine-tuning the underlying model. When we compare our approach against related training, guidance, and optimization-based methods, we find DITTO achieves state-of-the-art performance on nearly all tasks, including outperforming comparable approaches on controllability, audio quality, and computational efficiency, thus opening the door for high-quality, flexible, training-free control of diffusion models. Sound examples can be found at https://DITTO-Music.github.io/web/.
Paper Structure (37 sections, 12 equations, 10 figures, 11 tables, 1 algorithm)

This paper contains 37 sections, 12 equations, 10 figures, 11 tables, 1 algorithm.

Figures (10)

  • Figure 1: We propose DITTO, or Diffusion Inference-Time $\boldsymbol{T}$-Optimization, a general-purpose framework to control pre-trained diffusion models at inference-time. 1) We sample an initial noise latent $\bm{x}_T$; 2) run diffusion sampling to generate a music spectrogram $\bm{x}_0$; 3) extract features from the generated content; 4) input a target control signal; and 5) optimize the initial noise latent to fit any differentiable loss.
  • Figure 2: Different memory setups for backpropagation through sampling. Normally, all intermediate activations are stored in memory, which is intractable for modern diffusion models. In DITTO, gradient checkpointing allows us to achieve efficient memory usage with only 2x the number of model calls to preserve fast runtime.
  • Figure 3: Examples of DITTO's use for creative control, including intensity (left), melody (middle), and structure (right), with target controls and final features displayed below each spectrogram. All results are achieved without additional training or fine-tuning.
  • Figure 4: Failure cases of baseline outpainting methods. Baseline methods tend to create audible "seams" in the audio between overlap and non-overlap regions of the generated output, leading to unnatural jumps in semantic content. DITTO avoids this issue and provides seamless outpainting throughout the full generation.
  • Figure 5: Forward and Backward pass for DOODL, both in its official implementation and alternatively by using the EDICT invertible layers. The standard DOODL backprop doubles the number of model calls (relative to DITTO) due to the EDICT sampling, yet uses checkpointing to store function inputs for each timestep. When utilizing EDICT's invertibility, only the final outputs are stored in memory, yet the inversion process requires two more model passes per timestep during the backwards pass.
  • ...and 5 more figures