Table of Contents
Fetching ...

Spectral Identifiability for Interpretable Probe Geometry

William Hao-Cheng Huang

TL;DR

The paper tackles the instability of linear probes used to interpret neural representations by introducing the Spectral Identifiability Principle (SIP), a verifiable pre-deployment condition rooted in spectral perturbation theory. SIP links the Fisher estimation error $\\Delta$ to the discriminative subspace gap $\\operatorname{gap}(\\Gamma)$, yielding a finite-sample criterion for subspace stability and misclassification risk that can be checked directly from data. The authors deliver a rigorous theory showing subspace concentration and risk bounds under a margin condition, and validate the predictions with controlled synthetic experiments across Gaussian and heavy-tailed regimes, including a heavy-tail clipping analysis that reveals an optimal operating point. The work provides a practical, interpretable diagnostic to anticipate probe instability, potentially informing monitoring and robustness practices in large-scale models and downstream interpretability tasks.

Abstract

Linear probes are widely used to interpret and evaluate neural representations, yet their reliability remains unclear, as probes may appear accurate in some regimes but collapse unpredictably in others. We uncover a spectral mechanism behind this phenomenon and formalize it as the Spectral Identifiability Principle (SIP), a verifiable Fisher-inspired condition for probe stability. When the eigengap separating task-relevant directions is larger than the Fisher estimation error, the estimated subspace concentrates and accuracy remains consistent, whereas closing this gap induces instability in a phase-transition manner. Our analysis connects eigengap geometry, sample size, and misclassification risk through finite-sample reasoning, providing an interpretable diagnostic rather than a loose generalization bound. Controlled synthetic studies, where Fisher quantities are computed exactly, confirm these predictions and show how spectral inspection can anticipate unreliable probes before they distort downstream evaluation.

Spectral Identifiability for Interpretable Probe Geometry

TL;DR

The paper tackles the instability of linear probes used to interpret neural representations by introducing the Spectral Identifiability Principle (SIP), a verifiable pre-deployment condition rooted in spectral perturbation theory. SIP links the Fisher estimation error to the discriminative subspace gap , yielding a finite-sample criterion for subspace stability and misclassification risk that can be checked directly from data. The authors deliver a rigorous theory showing subspace concentration and risk bounds under a margin condition, and validate the predictions with controlled synthetic experiments across Gaussian and heavy-tailed regimes, including a heavy-tail clipping analysis that reveals an optimal operating point. The work provides a practical, interpretable diagnostic to anticipate probe instability, potentially informing monitoring and robustness practices in large-scale models and downstream interpretability tasks.

Abstract

Linear probes are widely used to interpret and evaluate neural representations, yet their reliability remains unclear, as probes may appear accurate in some regimes but collapse unpredictably in others. We uncover a spectral mechanism behind this phenomenon and formalize it as the Spectral Identifiability Principle (SIP), a verifiable Fisher-inspired condition for probe stability. When the eigengap separating task-relevant directions is larger than the Fisher estimation error, the estimated subspace concentrates and accuracy remains consistent, whereas closing this gap induces instability in a phase-transition manner. Our analysis connects eigengap geometry, sample size, and misclassification risk through finite-sample reasoning, providing an interpretable diagnostic rather than a loose generalization bound. Controlled synthetic studies, where Fisher quantities are computed exactly, confirm these predictions and show how spectral inspection can anticipate unreliable probes before they distort downstream evaluation.

Paper Structure

This paper contains 63 sections, 3 theorems, 39 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Theorem 4.1

Assume (R), (S), (V), and (M). Let $U$ denote the top-$k$ eigenspace of $\Gamma$, and $\widehat{U}$ its empirical counterpart obtained from $\widehat{\Gamma}$. There exist constants $c,C>0$ (depending on distributional moments and margin parameters) such that, with probability at least $1 - d^{-c}$ Hence, SIP provides a verifiable sufficient condition for probe stability:

Figures (3)

  • Figure 1: Geometry. Subspace stability is governed by the spectral threshold in (S).
  • Figure 2: Variability. Clipping has negligible effect in Gaussian but stabilizes in heavy-tailed regimes.
  • Figure 3: Probability and sample complexity. $\kappa$ and concentration jointly govern stability.

Theorems & Definitions (5)

  • Theorem 4.1: Sufficient condition for probe stability
  • Theorem A.1: Subspace concentration and misclassification: assumptions and conclusions
  • proof : Proof of Theorem \ref{['thm:formal-A']}
  • Lemma A.2: coordinate perturbation
  • proof