The Loss Kernel: A Geometric Probe for Deep Learning Interpretability
Maxwell Adam, Zach Furman, Jesse Hoogland
TL;DR
This work introduces the loss kernel, a covariance-based measure of functional similarity between inputs derived from a localized, low-loss probe distribution around a trained neural network. Grounded in singular learning theory, the kernel captures how pairs of inputs respond to joint perturbations in the near-minimal weight space, enabling global structuring and visualization of data as perceived by the model. The authors validate the method on a synthetic multitask problem, where the kernel cleanly separates independent subtasks, and apply it to Inception-v1 on ImageNet, revealing hierarchical structure that aligns with the WordNet taxonomy. This provides a practical, scalable tool for interpretability and data attribution, with potential to guide mechanistic investigations and developmental analyses of model learning.
Abstract
We introduce the loss kernel, an interpretability method for measuring similarity between data points according to a trained neural network. The kernel is the covariance matrix of per-sample losses computed under a distribution of low-loss-preserving parameter perturbations. We first validate our method on a synthetic multitask problem, showing it separates inputs by task as predicted by theory. We then apply this kernel to Inception-v1 to visualize the structure of ImageNet, and we show that the kernel's structure aligns with the WordNet semantic hierarchy. This establishes the loss kernel as a practical tool for interpretability and data attribution.
