BlockCert: Certified Blockwise Extraction of Transformer Mechanisms
Sandro Andric
TL;DR
BlockCert tackles the gap between mechanistic interpretability and formal verification by offering certified blockwise extraction of transformer mechanisms. It introduces an explicit intermediate representation (IR) for residual blocks, attaches machine-checkable certificates that quantify per-block approximation error and coverage, and provides a Lean 4 global-composition theorem to bound the overall deviation of stitched models. Empirically, BlockCert achieves high per-block coverage and small residuals across GPT-2 small, TinyLlama-1.1B-Chat, and Llama-3.2-3B, including a fully stitched TinyLlama that matches baseline perplexity on stress prompts to within $6\times10^{-5}$. The framework also supports certified local edits, showing how edits can be documented and bounded, thereby bridging mechanistic interpretability with formal reasoning about model behavior and safe modification in real-world transformers.
Abstract
Mechanistic interpretability aspires to reverse-engineer neural networks into explicit algorithms, while model editing seeks to modify specific behaviours without retraining. Both areas are typically evaluated with informal evidence and ad-hoc experiments, with few explicit guarantees about how far an extracted or edited model can drift from the original on relevant inputs. We introduce BlockCert, a framework for certified blockwise extraction of transformer mechanisms, and outline how a lightweight extension can support certified local edits. Given a pre-trained transformer and a prompt distribution, BlockCert extracts structured surrogate implementations for residual blocks together with machine-checkable certificates that bound approximation error, record coverage metrics, and hash the underlying artifacts. We formalize a simple Lipschitz-based composition theorem in Lean 4 that lifts these local guarantees to a global deviation bound. Empirically, we apply the framework to GPT-2 small, TinyLlama-1.1B-Chat, and Llama-3.2-3B. Across these models we obtain high per-block coverage and small residual errors on the evaluated prompts, and in the TinyLlama setting we show that a fully stitched model matches the baseline perplexity within approximately 6e-5 on stress prompts. Our results suggest that blockwise extraction with explicit certificates is feasible for real transformer language models and offers a practical bridge between mechanistic interpretability and formal reasoning about model behaviour.
