Compact Proofs of Model Performance via Mechanistic Interpretability

Jason Gross; Rajashree Agrawal; Thomas Kwa; Euan Ong; Chun Hei Yip; Alex Gibson; Soufiane Noubir; Lawrence Chan

Compact Proofs of Model Performance via Mechanistic Interpretability

Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan

TL;DR

This work investigates whether mechanistic interpretability can yield compact, provable generalization bounds for neural networks. By studying a one-layer transformer on a Max-of-K task and reverse-engineering its weights, the authors develop 102 proof strategies that vary in computational cost and bound tightness, revealing a consistent link between deeper mechanistic understanding and more compact, tighter proofs. They introduce a formal framework for global performance proofs, including pessimal ablations and convex relaxations, and demonstrate subcubic strategies that leverage low-rank structure to scale proofs to larger input spaces. A key finding is that while more faithful interpretations tend to tighten bounds, compounding structureless noise poses a major obstacle to obtaining non-vacuous proofs at scale. The results highlight a promising yet challenging path toward formally verifiable model guarantees grounded in mechanistic insight, with future work aimed at scaling to larger architectures and more complex tasks while mitigating noise-increment in proofs.

Abstract

We propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving accuracy lower bounds for a small transformer trained on Max-of-K, validating proof transferability across 151 random seeds and four values of K. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless errors as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.

Compact Proofs of Model Performance via Mechanistic Interpretability

TL;DR

Abstract

Paper Structure (101 sections, 26 theorems, 94 equations, 21 figures, 5 tables, 7 algorithms)

This paper contains 101 sections, 26 theorems, 94 equations, 21 figures, 5 tables, 7 algorithms.

Introduction
Mechanistic interpretability for proofs
Generalization bounds on global performance
Proof template
Proof compactness vs. tightness of bound
Proof as pessimal ablation
Experimental setting
Model Architecture
Task
Path decomposition
Mechanistic interpretation of learned models
QK circuit
OV circuit
Direct path
Proofs of model performance
...and 86 more sections

Key Result

Theorem 1

For $\Call{brute-force}{\dvocab, \nctx, \Model}$ as defined in alg:brute-force-counting,

Figures (21)

Figure 1: We construct proofs using different degrees of mechanistic interpretation. (Left) The models we consider in this paper are one-layer attention-only transformers, and so contain three "paths": the OV circuit, the QK circuit, and the direct path. (Right) For the brute-force proof (\ref{['sec:cubic-proof']}), we treat the model as a black box and thus need to check all possible combinations of inputs. For the cubic proof (\ref{['sec:cubic-proof']}), we decompose the model into its three corresponding paths, but still check the correctness of each path via brute force. Finally, in some subcubic proofs (\ref{['sec:results:proof:sub-cubic']}), we use all parts of the mechanistic interpretation presented in \ref{['sec:experimental-setting']}. (Bottom) For each of the three categories of proof, we report the number of FLOPs used in computing the certificate (lower=better, \ref{['sec:computing-flops']}), lower bound on model accuracy (higher=better), effective dimension of the unexplained parts of the model (lower=better, \ref{['sec:computing-unexplained-dimensionality-reduction']}), and asymptotic complexity of the proof strategy as we scale the inputs and model (lower=better). Significantly more compact proofs have vacuous accuracy bounds by default. Using more mechanistic understanding allows us to recover some, but not all, of the accuracy bounds on these more compact proofs, as our understanding is not fully faithful to the model internals.
Figure 2: The models in our setting implement Max-of-$K$ by attending exponentially more to larger tokens and copying the attended-to tokens (\ref{['sec:experimental-setting:mech-interp']}).
Figure 3: For each of the proofs in \ref{['sec:provably-bounding-model-performance']}, we plot the number of FLOPs used to compute the certificate, as well as the normalized accuracy lower-bound ($\bound/\AverageScore$). The brute-force proof (\ref{['sec:results:proof:brute-force']}) computes the exact performance uses orders of magnitude more compute than other approaches. The cubic proof (\ref{['sec:results:proof:sub-cubic']}) uses a small amount of mechanistic understanding and less compute, while still retaining good accuracy lower bounds. Finally, subcubic proofs (\ref{['sec:results:proof:sub-cubic']}) require the entirety of the mechanistic interpretation of the model to attain non-vacuous bounds; this understanding allows us to further reduces compute costs, but we still achieve worse bounds. See \ref{['sec:trick-comparison:grouped-by-complexity']} for a detailed description of the various proof strategies.
Figure 4: We plot the normalized accuracy bound versus the ratio of first and second singular values of $\EPQKE$, for various types of subcubic proofs that depend on a rank-1 approximation $\EPQKE$. For each class of proof, the closer $\EPQKE$ is to rank-1, the tighter the accuracy bound. This suggests that more faithful interpretations lead to tighter bounds even holding proof length fixed. Note that the "svd" proof strategy has the clearest upward trend ($\bound/\AverageScore = 0.00(\sigma_1/\sigma_2) + 0.44$, $R^2 = 0.41$). See \ref{['sec:trick-comparison:grouped-by-attention']} for a detailed description of the various proof strategies.
Figure 5: We plot, for each proof, the approximate number of flops required to evaluate the proof, versus the unexplained dimensionality (\ref{['sec:short-proofs-understanding']}). More mechanistic understanding leaves fewer dimensions unexplained. We observe that more compact proofs seem to leave fewer unexplained dimensions, which is indicative of the relationship of mechanistic understanding and compact proofs. See \ref{['sec:trick-comparison:grouped-by-complexity']} for a detailed description of the various proof strategies.
...and 16 more figures

Theorems & Definitions (68)

Theorem 1
proof
Theorem 2: Warmup: Extremizing weighted averages
proof
Definition 1: Common theorem parameters
Definition 2: of a sequence via sorted tokens and a position permutation
Definition 3: sequence score
Definition 4: swap permutation
Lemma 1: Characterization of swapping tokens
proof
...and 58 more

Compact Proofs of Model Performance via Mechanistic Interpretability

TL;DR

Abstract

Compact Proofs of Model Performance via Mechanistic Interpretability

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (68)