Table of Contents
Fetching ...

The Loss Kernel: A Geometric Probe for Deep Learning Interpretability

Maxwell Adam, Zach Furman, Jesse Hoogland

TL;DR

This work introduces the loss kernel, a covariance-based measure of functional similarity between inputs derived from a localized, low-loss probe distribution around a trained neural network. Grounded in singular learning theory, the kernel captures how pairs of inputs respond to joint perturbations in the near-minimal weight space, enabling global structuring and visualization of data as perceived by the model. The authors validate the method on a synthetic multitask problem, where the kernel cleanly separates independent subtasks, and apply it to Inception-v1 on ImageNet, revealing hierarchical structure that aligns with the WordNet taxonomy. This provides a practical, scalable tool for interpretability and data attribution, with potential to guide mechanistic investigations and developmental analyses of model learning.

Abstract

We introduce the loss kernel, an interpretability method for measuring similarity between data points according to a trained neural network. The kernel is the covariance matrix of per-sample losses computed under a distribution of low-loss-preserving parameter perturbations. We first validate our method on a synthetic multitask problem, showing it separates inputs by task as predicted by theory. We then apply this kernel to Inception-v1 to visualize the structure of ImageNet, and we show that the kernel's structure aligns with the WordNet semantic hierarchy. This establishes the loss kernel as a practical tool for interpretability and data attribution.

The Loss Kernel: A Geometric Probe for Deep Learning Interpretability

TL;DR

This work introduces the loss kernel, a covariance-based measure of functional similarity between inputs derived from a localized, low-loss probe distribution around a trained neural network. Grounded in singular learning theory, the kernel captures how pairs of inputs respond to joint perturbations in the near-minimal weight space, enabling global structuring and visualization of data as perceived by the model. The authors validate the method on a synthetic multitask problem, where the kernel cleanly separates independent subtasks, and apply it to Inception-v1 on ImageNet, revealing hierarchical structure that aligns with the WordNet taxonomy. This provides a practical, scalable tool for interpretability and data attribution, with potential to guide mechanistic investigations and developmental analyses of model learning.

Abstract

We introduce the loss kernel, an interpretability method for measuring similarity between data points according to a trained neural network. The kernel is the covariance matrix of per-sample losses computed under a distribution of low-loss-preserving parameter perturbations. We first validate our method on a synthetic multitask problem, showing it separates inputs by task as predicted by theory. We then apply this kernel to Inception-v1 to visualize the structure of ImageNet, and we show that the kernel's structure aligns with the WordNet semantic hierarchy. This establishes the loss kernel as a practical tool for interpretability and data attribution.

Paper Structure

This paper contains 63 sections, 2 theorems, 41 equations, 11 figures, 1 table.

Key Result

Proposition 1

The Gibbs expectation $\mathbb{E}_{\beta}[f({\bm{w}})]$ is the Laplace transform of the low-loss integral $g(\epsilon)$, up to a known factor: where $\mathcal{L}\{\cdot\}(\beta)$ denotes the Laplace transform with respect to $\epsilon$.

Figures (11)

  • Figure 1: Geometry of the loss kernel for Inception-v1 on ImageNet.A UMAP of pairwise distances induced by the normalized loss kernel $R({\mathbf{z}},{\mathbf{z}}')=\mathrm{Corr}_{{\bm{w}}\sim p({\bm{w}}\mid {\mathcal{D}})}[\ell({\mathbf{z}};{\bm{w}}),\,\ell({\mathbf{z}}';{\bm{w}})]$ for Inception-v1 on ImageNet-1k; each point is one image, colored continuously by position in the ImageNet hierarchy. Similar colors indicate inputs are semantically similar. 1--9 Insets: example neighborhoods with thumbnails showing coherent regions for dogs (1), primates (2), birds (3), diaspids (4), crustaceans (5), insects (6), produce (7), musical instruments (8), and vehicles/cars (9). Bottom right: Orbit views of the same 3-D embedding. B The full correlation kernel matrix (10k$\times$10k) next to the ground truth distance matrix derived from the ImageNet hierarchy shows similar block structures in both.
  • Figure 2: The loss kernel. The loss kernel $K({\mathbf{z}},{\mathbf{z}}')$ is the covariance of per-sample losses $\ell({\mathbf{z}},{\bm{w}})$ for two inputs ${\mathbf{z}}$ and ${\mathbf{z}}'$, computed over a probe distribution of model weights ${\bm{w}}$ (gray points) sampled near a trained solution ${{\bm{w}}^{\ast}}$. These two losses respond differently to different weights (top left, bottom right), reflecting which parts of the model are important for those inputs. A positive correlation in these losses (scatter plot, top right) signifies that the two inputs share sensitivity to the same weight perturbations, which we interpret as evidence that the model is treating the inputs ${\mathbf{z}}$ and ${\mathbf{z}}'$ similarly.
  • Figure 3: Geometry of the loss kernel for a multitask modular-arithmetic model ($p=97$). (A) UMAP of pairwise distances derived from the loss kernel ($d({\mathbf{z}},{\mathbf{z}}')=1-R({\mathbf{z}},{\mathbf{z}}')$. Two well-separated clusters correspond to modular addition (blue) and modular division (orange). A small satellite cluster corresponds to the trivial modular division case $a=0$, for which $0/b \equiv 0 \pmod{97}$. (B) Distribution of projections onto the first principal component of the normalized per-sample expected loss vectors, $\mathbb E[\ell({\mathbf{z}}_i;{\bm{w}})]-\ell({\mathbf{z}}_i; {{\bm{w}}^{\ast}})$. A single axis suffices to separate tasks (ROC--AUC $=0.931$). (C) Same UMAP as in (A), colored by the value of input $b$. (D) Log-scaled covariance distributions for Addition vs. Addition, Division vs. Division, and Addition vs. Division pairs. Within-task covariances are heavy-tailed and skewed, whereas cross-task covariances are narrowly concentrated and approximately normal.
  • Figure 4: Top-correlated examples under the loss kernel reveal interpretable patterns. For each reference image (leftmost column), we show the top five most-correlated inputs under the loss correlation kernel $R$. We observe clustering by texture (e.g., fluffy fur coat and fluffy animals), shape (e.g., circular objects and line angle), color and category (e.g., people playing sports, electronics on a white background, dark vs. light brown dogs), and spatial layout (e.g., cluttered rooms). Additional visualizations are provided in Appendix \ref{['appendix:extra_im_examples']}, and further computed correlation results are available at https://github.com/singfluence-anon/sf_imagenet_corrs
  • Figure 5: Taxonomic lift vs. hierarchy depth. Lines depicts the weighted probability (lift) that the nearest neighbors of an input with a label $d$ nodes deep in the WordNet hierarchy will share a parent node at depth $d'$. The $x$--axis is the WordNet tree distance (edges) from the root to the shared ancestor. We report lift as the ratio of this probability to the dataset base rate at depth. See \ref{['appendix:quant_hierarchy']} for details.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Proposition 1
  • proof
  • Proposition 2
  • proof