Table of Contents
Fetching ...

What Do Neurons Listen To? A Neuron-level Dissection of a General-purpose Audio Model

Takao Kawamura, Daisuke Niizumi, Nobutaka Ono

TL;DR

The paper investigates how general-purpose audio SSL models generalize to unseen tasks by inspecting internal neuron-level representations. It introduces Audio Activation Probability Entropy (AAPE) to identify class-specific neurons and analyzes their activation patterns, cross-task sharing, and causal role via ablations. Findings show SSL models develop extensive class-specific neurons with near-complete task coverage, plus shared responses for speech attributes, pitch, and acoustic similarities, and that these neurons contribute to classification. The work provides a mechanistic view of audio model generalization and offers a path toward more interpretable and robust audio foundation models.

Abstract

In this paper, we analyze the internal representations of a general-purpose audio self-supervised learning (SSL) model from a neuron-level perspective. Despite their strong empirical performance as feature extractors, the internal mechanisms underlying the robust generalization of SSL audio models remain unclear. Drawing on the framework of mechanistic interpretability, we identify and examine class-specific neurons by analyzing conditional activation patterns across diverse tasks. Our analysis reveals that SSL models foster the emergence of class-specific neurons that provide extensive coverage across novel task classes. These neurons exhibit shared responses across different semantic categories and acoustic similarities, such as speech attributes and musical pitch. We also confirm that these neurons have a functional impact on classification performance. To our knowledge, this is the first systematic neuron-level analysis of a general-purpose audio SSL model, providing new insights into its internal representation.

What Do Neurons Listen To? A Neuron-level Dissection of a General-purpose Audio Model

TL;DR

The paper investigates how general-purpose audio SSL models generalize to unseen tasks by inspecting internal neuron-level representations. It introduces Audio Activation Probability Entropy (AAPE) to identify class-specific neurons and analyzes their activation patterns, cross-task sharing, and causal role via ablations. Findings show SSL models develop extensive class-specific neurons with near-complete task coverage, plus shared responses for speech attributes, pitch, and acoustic similarities, and that these neurons contribute to classification. The work provides a mechanistic view of audio model generalization and offers a path toward more interpretable and robust audio foundation models.

Abstract

In this paper, we analyze the internal representations of a general-purpose audio self-supervised learning (SSL) model from a neuron-level perspective. Despite their strong empirical performance as feature extractors, the internal mechanisms underlying the robust generalization of SSL audio models remain unclear. Drawing on the framework of mechanistic interpretability, we identify and examine class-specific neurons by analyzing conditional activation patterns across diverse tasks. Our analysis reveals that SSL models foster the emergence of class-specific neurons that provide extensive coverage across novel task classes. These neurons exhibit shared responses across different semantic categories and acoustic similarities, such as speech attributes and musical pitch. We also confirm that these neurons have a functional impact on classification performance. To our knowledge, this is the first systematic neuron-level analysis of a general-purpose audio SSL model, providing new insights into its internal representation.
Paper Structure (10 sections, 2 equations, 8 figures, 2 tables)

This paper contains 10 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Shared neurons between VC1 and CREMA-D under gender-based class definitions. SSL demonstrates clear cross-task sharing of neurons aligned with gender, whereas SL exhibits negligible sharing.
  • Figure 2: Common neuron ratios for each octave class in NSynth and Surge. Despite differences in synthesis methods and dataset characteristics, octave-specific neurons are consistently observed across both tasks.
  • Figure 3: Common neuron ratios for semantically overlapping event classes across ESC-50 and GISE-51. Compared with Fig. \ref{['fig-gender']} and Fig. \ref{['fig-octave']}, the ratios are generally lower.
  • Figure 4: Common neuron ratios across genre classes in GTZAN. "Classical" and "jazz" share a relatively large number of neurons, exhibiting a distinct sharing pattern compared with other genres.
  • Figure 5: Common neuron ratios across language classes in VoxForge. SSL results exhibit relatively high neuron sharing within Germanic ("de," "en") and Romance ("es," "fr," "it") language families.
  • ...and 3 more figures