Presto! Distilling Steps and Layers for Accelerating Music Generation

Zachary Novack; Ge Zhu; Jonah Casebeer; Julian McAuley; Taylor Berg-Kirkpatrick; Nicholas J. Bryan

Presto! Distilling Steps and Layers for Accelerating Music Generation

Zachary Novack, Ge Zhu, Jonah Casebeer, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas J. Bryan

TL;DR

This work tackles the slow inference of diffusion-based music generation by introducing Presto, a dual-distillation framework that simultaneously reduces sampling steps and per-step cost. It comprises Presto-S for EDM-style distribution matching with GAN-based step distillation, Presto-L for variance-preserving layer dropping, and Presto-LS to jointly leverage both strategies; continuous-time conditioning and careful loss-noise design are central. The authors demonstrate state-of-the-art acceleration (10-18x, with latencies around $230/435$ ms for 32 s) while preserving quality and diversity, outperforming multiple baselines and human judgments. These results highlight the potential of joint step-and-layer distillation for interactive, high-fidelity music generation and point to extensions across modalities.

Abstract

Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge. Sound examples can be found at https://presto-music.github.io/web/.

Presto! Distilling Steps and Layers for Accelerating Music Generation

TL;DR

ms for 32 s) while preserving quality and diversity, outperforming multiple baselines and human judgments. These results highlight the potential of joint step-and-layer distillation for interactive, high-fidelity music generation and point to extensions across modalities.

Abstract

Paper Structure (38 sections, 9 equations, 15 figures, 4 tables, 3 algorithms)

This paper contains 38 sections, 9 equations, 15 figures, 4 tables, 3 algorithms.

Introduction
Background & Related Work
Music Generation
Score-Based Diffusion Models
Diffusion Distillation
Presto!
EDM-Style Distribution Matching Distillation
Presto-S: Score-based Distribution Matching Distillation
Continuous-Time Generator Inputs
Perceptual Loss Weighting with Variable Noise Distributions
Audio-Aligned Discriminator Design
Presto-L: Variance and Budget-Aware Layer Dropping
Presto-LS: Layer-Step Distillation
Experiments
Setup
...and 23 more sections

Figures (15)

Figure 1: Presto-S. Our goal is to distill the initial "real" score model (grey) $\mu_{\boldsymbol{\theta}}$ into a few-step generator (light blue) $G_{\boldsymbol\phi}$ to minimize the KL divergence between the distribution of $G_{\boldsymbol\phi}$'s outputs and the real distribution. This requires that we train an auxillary "fake" score model $\mu_{\boldsymbol{\psi}}$ (dark blue) that estimates the score of the generator's distribution at each gradient step. Formally: (1) real audio is corrupted with Gaussian noise sampled from the generator noise distribution $p_{\text{gen}}(\sigma^{\text{inf}})$ which is then (2) passed into the generator to get its output. Noise is then added to this generation according to three different noise distributions: (3) $p_{\text{DMD}}(\sigma^{\text{train}})$, which is (4) passed into both the real and fake score models to calculate the distribution matching gradient $\nabla_\phi\mathcal{L}_{\text{DMD}}$; (5) $p_{\text{DSM}}(\sigma^{\text{train/inf}})$, which is used to (6) train the fake score model on the generator's distribution with $\mathcal{L}_{\text{fake-DSM}}$; and (7) an adversarial distribution $p_{\text{GAN}}(\sigma^{\text{train}})$, which along with the real audio is (8) passed into a least-squares discriminator built on the fake score model's intermediate activations to calculate $\mathcal{L}_{\text{GAN}}$.
Figure 2: Training/Inference distributions for EDM models, in decibel SNR space.
Figure 3: Baseline layer dropping (left) vs. Presto-L (right). Standard layer dropping uses the noise level $\sigma$ to set the budget of layers to drop, starting from the back of the DiT blocks. Presto-L shifts this dropping by one to the second-to-last block and adds explicit budget conditioning.
Figure 4: Hidden activation variance vs. layer depth. Each line is a unique noise level.
Figure 5: Continuous generator inputs vs. discrete inputs. Continuous inputs shows more consistent scaling with compute, while generally performing better in both quality and text relevance.
...and 10 more figures

Presto! Distilling Steps and Layers for Accelerating Music Generation

TL;DR

Abstract

Presto! Distilling Steps and Layers for Accelerating Music Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (15)