Table of Contents
Fetching ...

Presto! Distilling Steps and Layers for Accelerating Music Generation

Zachary Novack, Ge Zhu, Jonah Casebeer, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas J. Bryan

TL;DR

This work tackles the slow inference of diffusion-based music generation by introducing Presto, a dual-distillation framework that simultaneously reduces sampling steps and per-step cost. It comprises Presto-S for EDM-style distribution matching with GAN-based step distillation, Presto-L for variance-preserving layer dropping, and Presto-LS to jointly leverage both strategies; continuous-time conditioning and careful loss-noise design are central. The authors demonstrate state-of-the-art acceleration (10-18x, with latencies around $230/435$ ms for 32 s) while preserving quality and diversity, outperforming multiple baselines and human judgments. These results highlight the potential of joint step-and-layer distillation for interactive, high-fidelity music generation and point to extensions across modalities.

Abstract

Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge. Sound examples can be found at https://presto-music.github.io/web/.

Presto! Distilling Steps and Layers for Accelerating Music Generation

TL;DR

This work tackles the slow inference of diffusion-based music generation by introducing Presto, a dual-distillation framework that simultaneously reduces sampling steps and per-step cost. It comprises Presto-S for EDM-style distribution matching with GAN-based step distillation, Presto-L for variance-preserving layer dropping, and Presto-LS to jointly leverage both strategies; continuous-time conditioning and careful loss-noise design are central. The authors demonstrate state-of-the-art acceleration (10-18x, with latencies around ms for 32 s) while preserving quality and diversity, outperforming multiple baselines and human judgments. These results highlight the potential of joint step-and-layer distillation for interactive, high-fidelity music generation and point to extensions across modalities.

Abstract

Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge. Sound examples can be found at https://presto-music.github.io/web/.
Paper Structure (38 sections, 9 equations, 15 figures, 4 tables, 3 algorithms)

This paper contains 38 sections, 9 equations, 15 figures, 4 tables, 3 algorithms.

Figures (15)

  • Figure 1: Presto-S. Our goal is to distill the initial "real" score model (grey) $\mu_{\boldsymbol{\theta}}$ into a few-step generator (light blue) $G_{\boldsymbol\phi}$ to minimize the KL divergence between the distribution of $G_{\boldsymbol\phi}$'s outputs and the real distribution. This requires that we train an auxillary "fake" score model $\mu_{\boldsymbol{\psi}}$ (dark blue) that estimates the score of the generator's distribution at each gradient step. Formally: (1) real audio is corrupted with Gaussian noise sampled from the generator noise distribution $p_{\text{gen}}(\sigma^{\text{inf}})$ which is then (2) passed into the generator to get its output. Noise is then added to this generation according to three different noise distributions: (3) $p_{\text{DMD}}(\sigma^{\text{train}})$, which is (4) passed into both the real and fake score models to calculate the distribution matching gradient $\nabla_\phi\mathcal{L}_{\text{DMD}}$; (5) $p_{\text{DSM}}(\sigma^{\text{train/inf}})$, which is used to (6) train the fake score model on the generator's distribution with $\mathcal{L}_{\text{fake-DSM}}$; and (7) an adversarial distribution $p_{\text{GAN}}(\sigma^{\text{train}})$, which along with the real audio is (8) passed into a least-squares discriminator built on the fake score model's intermediate activations to calculate $\mathcal{L}_{\text{GAN}}$.
  • Figure 2: Training/Inference distributions for EDM models, in decibel SNR space.
  • Figure 3: Baseline layer dropping (left) vs. Presto-L (right). Standard layer dropping uses the noise level $\sigma$ to set the budget of layers to drop, starting from the back of the DiT blocks. Presto-L shifts this dropping by one to the second-to-last block and adds explicit budget conditioning.
  • Figure 4: Hidden activation variance vs. layer depth. Each line is a unique noise level.
  • Figure 5: Continuous generator inputs vs. discrete inputs. Continuous inputs shows more consistent scaling with compute, while generally performing better in both quality and text relevance.
  • ...and 10 more figures