Table of Contents
Fetching ...

From Neurons to Neutrons: A Case Study in Interpretability

Ouail Kitouni, Niklas Nolte, Víctor Samuel Pérez-Díaz, Sokratis Trifinopoulos, Mike Williams

TL;DR

The paper investigates whether mechanistic interpretability can extract scientifically meaningful knowledge from neural networks trained on complex, high‑dimensional data. It extends MI from modular arithmetic to nuclear physics by training a fixed‑attention transformer on nuclear data and analyzing embeddings, latent space topography, and hidden activations with PCA, helix analyses, and symbolic regression. The authors find that proton/neutron embeddings organize into helices and parity structures that align with the semi‑empirical mass formula and shell‑model corrections, and that multi‑task learning enhances generalization and interpretability. This work demonstrates a proof‑of‑concept that neural networks can learn and communicate domain knowledge, offering interpretable corrections to established theories and guiding scientific discovery in data‑rich domains.

Abstract

Mechanistic Interpretability (MI) promises a path toward fully understanding how neural networks make their predictions. Prior work demonstrates that even when trained to perform simple arithmetic, models can implement a variety of algorithms (sometimes concurrently) depending on initialization and hyperparameters. Does this mean neuron-level interpretability techniques have limited applicability? We argue that high-dimensional neural networks can learn low-dimensional representations of their training data that are useful beyond simply making good predictions. Such representations can be understood through the mechanistic interpretability lens and provide insights that are surprisingly faithful to human-derived domain knowledge. This indicates that such approaches to interpretability can be useful for deriving a new understanding of a problem from models trained to solve it. As a case study, we extract nuclear physics concepts by studying models trained to reproduce nuclear data.

From Neurons to Neutrons: A Case Study in Interpretability

TL;DR

The paper investigates whether mechanistic interpretability can extract scientifically meaningful knowledge from neural networks trained on complex, high‑dimensional data. It extends MI from modular arithmetic to nuclear physics by training a fixed‑attention transformer on nuclear data and analyzing embeddings, latent space topography, and hidden activations with PCA, helix analyses, and symbolic regression. The authors find that proton/neutron embeddings organize into helices and parity structures that align with the semi‑empirical mass formula and shell‑model corrections, and that multi‑task learning enhances generalization and interpretability. This work demonstrates a proof‑of‑concept that neural networks can learn and communicate domain knowledge, offering interpretable corrections to established theories and guiding scientific discovery in data‑rich domains.

Abstract

Mechanistic Interpretability (MI) promises a path toward fully understanding how neural networks make their predictions. Prior work demonstrates that even when trained to perform simple arithmetic, models can implement a variety of algorithms (sometimes concurrently) depending on initialization and hyperparameters. Does this mean neuron-level interpretability techniques have limited applicability? We argue that high-dimensional neural networks can learn low-dimensional representations of their training data that are useful beyond simply making good predictions. Such representations can be understood through the mechanistic interpretability lens and provide insights that are surprisingly faithful to human-derived domain knowledge. This indicates that such approaches to interpretability can be useful for deriving a new understanding of a problem from models trained to solve it. As a case study, we extract nuclear physics concepts by studying models trained to reproduce nuclear data.
Paper Structure (39 sections, 7 equations, 24 figures)

This paper contains 39 sections, 7 equations, 24 figures.

Figures (24)

  • Figure 1: Projections of neutron number embeddings onto their first three principal components (PCs). Models were trained on nuclear data (left) or a human-derived nuclear theory (right). X-axis: 1st PC, Y-axis: 2nd PC, color: 3rd PC. Numbers indicate the neutron number ($N$) of each nucleus (see Setup in \ref{['sec:beyond-arithmetic']}). The helix structure encodes insights about nuclear physics discussed in subsequent sections.
  • Figure 2: (left) Principal component projection of modular addition embeddings. The circular structure mirrors human-derived approaches used to teach modular arithmetic. (right) Model output in regions of the phase space. From liu2022towards.
  • Figure 3: Binding energy per nucleon as given by the SEMF formula (left) and observed in measurements (right).
  • Figure 4: Binding energy prediction error as a function of number of PCs used at different layers.
  • Figure 5: PC projections of Z embeddings from a model trained on all tasks. The color hue is a monotonic function of the proton number Z, to be able to quickly assess the presence of order.
  • ...and 19 more figures