Table of Contents
Fetching ...

Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders

Nathan Paek, Yongyi Zang, Qihui Yang, Randal Leistikow

TL;DR

This paper tackles the interpretability gap in audio generation by linking latent representations to human-understandable acoustic concepts through sparse autoencoders. It trains SAEs on latent spaces from both continuous and discrete audio encoders and learns linear mappings (linear probes) to discretized pitch, amplitude, and timbre, enabling controllable manipulation via additive control vectors. The approach yields varying degrees of linear decodability across acoustic properties, with pitch being the most linearly separable, and demonstrates how generation dynamics unfold in models like DiffRhythm, showing a coarse-to-fine emergence of acoustic structure. The framework provides a generalizable path for interpretable analysis of audio latent spaces and offers practical means to steer AI music generation, with potential extensions to other modalities like visuals.

Abstract

While sparse autoencoders (SAEs) successfully extract interpretable features from language models, applying them to audio generation faces unique challenges: audio's dense nature requires compression that obscures semantic meaning, and automatic feature characterization remains limited. We propose a framework for interpreting audio generative models by mapping their latent representations to human-interpretable acoustic concepts. We train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties (pitch, amplitude, and timbre). This enables both controllable manipulation and analysis of the AI music generation process, revealing how acoustic properties emerge during synthesis. We validate our approach on continuous (DiffRhythm-VAE) and discrete (EnCodec, WavTokenizer) audio latent spaces, and analyze DiffRhythm, a state-of-the-art text-to-music model, to demonstrate how pitch, timbre, and loudness evolve throughout generation. While our work is only done on audio modality, our framework can be extended to interpretable analysis of visual latent space generation models.

Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders

TL;DR

This paper tackles the interpretability gap in audio generation by linking latent representations to human-understandable acoustic concepts through sparse autoencoders. It trains SAEs on latent spaces from both continuous and discrete audio encoders and learns linear mappings (linear probes) to discretized pitch, amplitude, and timbre, enabling controllable manipulation via additive control vectors. The approach yields varying degrees of linear decodability across acoustic properties, with pitch being the most linearly separable, and demonstrates how generation dynamics unfold in models like DiffRhythm, showing a coarse-to-fine emergence of acoustic structure. The framework provides a generalizable path for interpretable analysis of audio latent spaces and offers practical means to steer AI music generation, with potential extensions to other modalities like visuals.

Abstract

While sparse autoencoders (SAEs) successfully extract interpretable features from language models, applying them to audio generation faces unique challenges: audio's dense nature requires compression that obscures semantic meaning, and automatic feature characterization remains limited. We propose a framework for interpreting audio generative models by mapping their latent representations to human-interpretable acoustic concepts. We train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties (pitch, amplitude, and timbre). This enables both controllable manipulation and analysis of the AI music generation process, revealing how acoustic properties emerge during synthesis. We validate our approach on continuous (DiffRhythm-VAE) and discrete (EnCodec, WavTokenizer) audio latent spaces, and analyze DiffRhythm, a state-of-the-art text-to-music model, to demonstrate how pitch, timbre, and loudness evolve throughout generation. While our work is only done on audio modality, our framework can be extended to interpretable analysis of visual latent space generation models.

Paper Structure

This paper contains 8 sections, 4 equations, 4 figures.

Figures (4)

  • Figure 1: Framework for interpreting and controlling audio generative models through sparse features learned on their generation space. Sparse autoencoders extract interpretable features from audio latents, which are then linearly mapped to acoustic concepts. Control vectors extracted from these linear mappings can then be used to transform audio.
  • Figure 2: Controlled audio manipulation via control vectors. When $\alpha$ increases, isolated changes in pitch (imminent C5), amplitude (decreasing loudness), and timbre (brightening via high-frequency emphasis) can be observed. Corresponding audio samples can be found here: https://anonymous.4open.science/r/audio_samples-A301/
  • Figure 3: Linear probe accuracy for acoustic property classification across different sparsity levels. Left: Stable Audio Open/DiffRhythm VAE, Middle: WavTokenizer, Right: EnCodec.
  • Figure 4: Probes Variation in Generation Progress