Table of Contents
Fetching ...

Discovering and Steering Interpretable Concepts in Large Generative Music Models

Nikhil Singh, Manuel Cherep, Pattie Maes

TL;DR

This work tackles how large transformer-based music generators internally represent musical structure by proposing a scalable pipeline that discovers interpretable concepts via sparse autoencoders applied to residual streams. It combines automated labeling with multimodal and classifier signals (including CLAP alignment) and validates concepts through human studies, while demonstrating practical steering by injecting concept vectors during generation. The approach recovers known musical categories and reveals emergent regularities that lack established theory, and it analyzes how interpretability scales with layer depth and model size. The results offer a transparent, controllable framework for understanding and guiding generative music systems, with implications for theory-inspired analysis and creative collaboration.

Abstract

The fidelity with which neural networks can now generate content such as music presents a scientific opportunity: these systems appear to have learned implicit theories of such content's structure through statistical learning alone. This offers a potentially new lens on theories of human-generated media. When internal representations align with traditional constructs (e.g. chord progressions in music), they show how such categories can emerge from statistical regularities; when they diverge, they expose limits of existing frameworks and patterns we may have overlooked but that nonetheless carry explanatory power. In this paper, focusing on music generators, we introduce a method for discovering interpretable concepts using sparse autoencoders (SAEs), extracting interpretable features from the residual stream of a transformer model. We make this approach scalable and evaluable using automated labeling and validation pipelines. Our results reveal both familiar musical concepts and coherent but uncodified patterns lacking clear counterparts in theory or language. As an extension, we show such concepts can be used to steer model generations. Beyond improving model transparency, our work provides an empirical tool for uncovering organizing principles that have eluded traditional methods of analysis and synthesis.

Discovering and Steering Interpretable Concepts in Large Generative Music Models

TL;DR

This work tackles how large transformer-based music generators internally represent musical structure by proposing a scalable pipeline that discovers interpretable concepts via sparse autoencoders applied to residual streams. It combines automated labeling with multimodal and classifier signals (including CLAP alignment) and validates concepts through human studies, while demonstrating practical steering by injecting concept vectors during generation. The approach recovers known musical categories and reveals emergent regularities that lack established theory, and it analyzes how interpretability scales with layer depth and model size. The results offer a transparent, controllable framework for understanding and guiding generative music systems, with implications for theory-inspired analysis and creative collaboration.

Abstract

The fidelity with which neural networks can now generate content such as music presents a scientific opportunity: these systems appear to have learned implicit theories of such content's structure through statistical learning alone. This offers a potentially new lens on theories of human-generated media. When internal representations align with traditional constructs (e.g. chord progressions in music), they show how such categories can emerge from statistical regularities; when they diverge, they expose limits of existing frameworks and patterns we may have overlooked but that nonetheless carry explanatory power. In this paper, focusing on music generators, we introduce a method for discovering interpretable concepts using sparse autoencoders (SAEs), extracting interpretable features from the residual stream of a transformer model. We make this approach scalable and evaluable using automated labeling and validation pipelines. Our results reveal both familiar musical concepts and coherent but uncodified patterns lacking clear counterparts in theory or language. As an extension, we show such concepts can be used to steer model generations. Beyond improving model transparency, our work provides an empirical tool for uncovering organizing principles that have eluded traditional methods of analysis and synthesis.

Paper Structure

This paper contains 39 sections, 2 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Multi-stage pipeline for discovering and steering interpretable concepts in generative music models. (1) Music from a large corpus is passed through a pre-trained generator to extract activations from multiple layers. (2) Sparse autoencoders reconstruct activations (also usable for steering), and features are filtered to retain the most viable candidates. (3) Retained features are characterized via musical examples and labeled using generative labeling with a multimodal LM and classifier-based labeling with pre-trained models.
  • Figure 2: Examples of features discovered using the sparse autoencoders we train. Note: these examples are labeled manually. Spectrograms highlight similarities across examples within a concept.
  • Figure 3: Avg. CLAP wu2023large score across layers, comparing feature audio to automatic concept labels. For MGL, later layers appear to produce more interpretable features on average.
  • Figure 4: Distribution of max. CLAP scores across all SAEs. Pooling both Gemini- and Essentia-produced labels, we score them using CLAP, showing the trade-off between confidence and coverage at different potential filter levels.
  • Figure 5: Examples of steered features. Note: these examples were labeled automatically. (Baseline) Generation without steering for "Simple melody." (Feature Examples) Top max. activating examples for the steering feature. (Steered) Generation steering with the same prompt and seed as the baseline, and maximum strength empirically calculated from the maximum activations. The steering shows close alignment with the feature examples, as seen in the spectrograms.
  • ...and 4 more figures