Low-Resource Guidance for Controllable Latent Audio Diffusion

Zachary Novack; Zack Zukowski; CJ Carr; Julian Parker; Zach Evans; Josiah Taylor; Taylor Berg-Kirkpatrick; Julian McAuley; Jordi Pons

Low-Resource Guidance for Controllable Latent Audio Diffusion

Zachary Novack, Zack Zukowski, CJ Carr, Julian Parker, Zach Evans, Josiah Taylor, Taylor Berg-Kirkpatrick, Julian McAuley, Jordi Pons

TL;DR

This work introduces a guidance-based approach through selective TFG and Latent-Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead and balances precision and audio fidelity with far lower computational costs than standard end-to-end guidance.

Abstract

Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time controls (\textit{e.g.}, guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance-based controls, in particular their high cost-per-step due to decoder backpropagation, we introduce a guidance-based approach through selective TFG and Latent-Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead. LatCHs operate directly in latent space, avoiding the expensive decoder step, and requiring minimal training resources (7M parameters and $\approx$ 4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method balances precision and audio fidelity with far lower computational costs than standard end-to-end guidance. Demo examples can be found at https://zacharynovack.github.io/latch/latch.html.

Low-Resource Guidance for Controllable Latent Audio Diffusion

TL;DR

Abstract

4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method balances precision and audio fidelity with far lower computational costs than standard end-to-end guidance. Demo examples can be found at https://zacharynovack.github.io/latch/latch.html.

Paper Structure (15 sections, 4 equations, 1 figure, 1 table)

This paper contains 15 sections, 4 equations, 1 figure, 1 table.

Introduction
BACKGROUND AND OUR METHOD
Latent Audio Diffusion Background
TFG Background
Selective TFG (our proposal)
Latent-Control Heads (LatCHs, our proposal)
Experiments
Baselines
Datasets
Selective TFG, DDIM, and CFG hyperparameters
Training LatCHs
Quantitative Evaluation
Qualitative evaluation
Results and discussion
Conclusions

Figures (1)

Figure 1: Left. End-to-end guidance can be slow and VRAM intensive as it requires backpropagating through the VAE decoder. Center. LatCH is compute efficient as it directly predicts control features from the latent space. Right. Selective TFG is also compute efficient as it allows applying TFG guidance only on selected sampling steps.

Low-Resource Guidance for Controllable Latent Audio Diffusion

TL;DR

Abstract

Low-Resource Guidance for Controllable Latent Audio Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (1)