Table of Contents
Fetching ...

TADA! Tuning Audio Diffusion Models through Activation Steering

Łukasz Staniszewski, Katarzyna Zaleska, Mateusz Modrzejewski, Kamil Deja

TL;DR

This work reveals that high-level musical attributes in text-to-audio diffusion systems are encoded within a small, shared subset of cross-attention layers, constituting a semantic bottleneck. By localizing these functional layers with Activation Patching, the authors apply targeted steering using Contrastive Activation Addition ($v_c^{\text{CAA}}$) and Sparse Autoencoders ($v_c^{\text{SAE}}$), achieving precise control over attributes like tempo, mood, vocal gender, and instrument presence while preserving audio fidelity. The approach is validated across multiple architectures, showing that steering only the bottleneck layers markedly outperforms global steering methods and maintains high audio quality, with SAEs offering interpretable, fine-grained control. Overall, layer-level localization combined with CAA and SAEs provides a robust, scalable path to fine-grained, high-fidelity audio editing and customization beyond what prompt-based prompts can achieve.

Abstract

Audio diffusion models can synthesize high-fidelity music from text, yet their internal mechanisms for representing high-level concepts remain poorly understood. In this work, we use activation patching to demonstrate that distinct semantic musical concepts, such as the presence of specific instruments, vocals, or genre characteristics, are controlled by a small, shared subset of attention layers in state-of-the-art audio diffusion architectures. Next, we demonstrate that applying Contrastive Activation Addition and Sparse Autoencoders in these layers enables more precise control over the generated audio, indicating a direct benefit of the specialization phenomenon. By steering activations of the identified layers, we can alter specific musical elements with high precision, such as modulating tempo or changing a track's mood.

TADA! Tuning Audio Diffusion Models through Activation Steering

TL;DR

This work reveals that high-level musical attributes in text-to-audio diffusion systems are encoded within a small, shared subset of cross-attention layers, constituting a semantic bottleneck. By localizing these functional layers with Activation Patching, the authors apply targeted steering using Contrastive Activation Addition () and Sparse Autoencoders (), achieving precise control over attributes like tempo, mood, vocal gender, and instrument presence while preserving audio fidelity. The approach is validated across multiple architectures, showing that steering only the bottleneck layers markedly outperforms global steering methods and maintains high audio quality, with SAEs offering interpretable, fine-grained control. Overall, layer-level localization combined with CAA and SAEs provides a robust, scalable path to fine-grained, high-fidelity audio editing and customization beyond what prompt-based prompts can achieve.

Abstract

Audio diffusion models can synthesize high-fidelity music from text, yet their internal mechanisms for representing high-level concepts remain poorly understood. In this work, we use activation patching to demonstrate that distinct semantic musical concepts, such as the presence of specific instruments, vocals, or genre characteristics, are controlled by a small, shared subset of attention layers in state-of-the-art audio diffusion architectures. Next, we demonstrate that applying Contrastive Activation Addition and Sparse Autoencoders in these layers enables more precise control over the generated audio, indicating a direct benefit of the specialization phenomenon. By steering activations of the identified layers, we can alter specific musical elements with high precision, such as modulating tempo or changing a track's mood.
Paper Structure (21 sections, 9 equations, 3 figures, 5 tables)

This paper contains 21 sections, 9 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: We study localized steering in Audio Diffusion Models. By localizing functional layers, we enable precise steering of generations with Contrastive Activation Addition and Sparse Autoencoders.
  • Figure 2: Layer localization via Activation Patching. For a given music concept $c$ (e.g., 'male voice'), we perform (a) a target run with prompt $P_c$ and cache the cross-attention keys and values. In (b) source run, we generate with prompt $P_{\tilde{c}}$, which represents a counterfactual concept (e.g., 'male voice') or does not contain $c$. We patch layer $l$ by substituting cross-attention key (K) and value (V) matrices with those cached from the $P_c$ run. In such a case, other layers receive $P_{\tilde{c}}$. If patching a layer produces audio containing concept $c$ (d), we identify it as a functional layer. Otherwise (c), the layer does not control the concept.
  • Figure 3: Functional cross-attention layers in AudioLDM2 liu2024audioldm, Stable Audio Open evans2024stable, and ACE-Step gong2025ace0step0 models. We demonstrate that singular layers control different musical concepts, including vocal gender, tempo, mood, instruments, and genres across diverse audio diffusion architectures.