Discovering and Steering Interpretable Concepts in Large Generative Music Models
Nikhil Singh, Manuel Cherep, Pattie Maes
TL;DR
This work tackles how large transformer-based music generators internally represent musical structure by proposing a scalable pipeline that discovers interpretable concepts via sparse autoencoders applied to residual streams. It combines automated labeling with multimodal and classifier signals (including CLAP alignment) and validates concepts through human studies, while demonstrating practical steering by injecting concept vectors during generation. The approach recovers known musical categories and reveals emergent regularities that lack established theory, and it analyzes how interpretability scales with layer depth and model size. The results offer a transparent, controllable framework for understanding and guiding generative music systems, with implications for theory-inspired analysis and creative collaboration.
Abstract
The fidelity with which neural networks can now generate content such as music presents a scientific opportunity: these systems appear to have learned implicit theories of such content's structure through statistical learning alone. This offers a potentially new lens on theories of human-generated media. When internal representations align with traditional constructs (e.g. chord progressions in music), they show how such categories can emerge from statistical regularities; when they diverge, they expose limits of existing frameworks and patterns we may have overlooked but that nonetheless carry explanatory power. In this paper, focusing on music generators, we introduce a method for discovering interpretable concepts using sparse autoencoders (SAEs), extracting interpretable features from the residual stream of a transformer model. We make this approach scalable and evaluable using automated labeling and validation pipelines. Our results reveal both familiar musical concepts and coherent but uncodified patterns lacking clear counterparts in theory or language. As an extension, we show such concepts can be used to steer model generations. Beyond improving model transparency, our work provides an empirical tool for uncovering organizing principles that have eluded traditional methods of analysis and synthesis.
