Table of Contents
Fetching ...

Automated Circuit Interpretation via Probe Prompting

Giuseppe Birardi

TL;DR

This work tackles mechanistic interpretability by automating circuit interpretation through probe prompting, transforming attribution graphs into compact, concept-aligned subgraphs categorized as Semantic, Relationship, and Say-X. It introduces cross-prompt activation signatures to quantify feature behavior across probe prompts, and employs transparent, rule-based decision logic to achieve interpretability-oriented compression with Completeness around 0.83 and Replacement around 0.54. Empirical results on multiple prompt families demonstrate superior behavioral coherence of concept-aligned groups over geometric baselines (2.3x token-consistency; 5.8x activation-pattern similarity) and reveal a layerwise hierarchy where early layers generalize across entity substitutions while late layers specialize for output promotion. The approach promises substantial speedups for first-pass analysis, standardizes a taxonomy for reporting circuit motifs, and provides a foundation for safety-oriented interpretability research, with public code and interactive demos to encourage community adoption.

Abstract

Mechanistic interpretability aims to understand neural networks by identifying which learned features mediate specific behaviors. Attribution graphs reveal these feature pathways, but interpreting them requires extensive manual analysis -- a single prompt can take approximately 2 hours for an experienced circuit tracer. We present probe prompting, an automated pipeline that transforms attribution graphs into compact, interpretable subgraphs built from concept-aligned supernodes. Starting from a seed prompt and target logit, we select high-influence features, generate concept-targeted yet context-varying probes, and group features by cross-prompt activation signatures into Semantic, Relationship, and Say-X categories using transparent decision rules. Across five prompts including classic "capitals" circuits, probe-prompted subgraphs preserve high explanatory coverage while compressing complexity (Completeness 0.83, mean across circuits; Replacement 0.54). Compared to geometric clustering baselines, concept-aligned groups exhibit higher behavioral coherence: 2.3x higher peak-token consistency (0.425 vs 0.183) and 5.8x higher activation-pattern similarity (0.762 vs 0.130), despite lower geometric compactness. Entity-swap tests reveal a layerwise hierarchy: early-layer features transfer robustly (64% transfer rate, mean layer 6.3), while late-layer Say-X features specialize for output promotion (mean layer 16.4), supporting a backbone-and-specialization view of transformer computation. We release code (https://github.com/peppinob-ol/attribution-graph-probing), an interactive demo (https://huggingface.co/spaces/Peppinob/attribution-graph-probing), and minimal artifacts enabling immediate reproduction and community adoption.

Automated Circuit Interpretation via Probe Prompting

TL;DR

This work tackles mechanistic interpretability by automating circuit interpretation through probe prompting, transforming attribution graphs into compact, concept-aligned subgraphs categorized as Semantic, Relationship, and Say-X. It introduces cross-prompt activation signatures to quantify feature behavior across probe prompts, and employs transparent, rule-based decision logic to achieve interpretability-oriented compression with Completeness around 0.83 and Replacement around 0.54. Empirical results on multiple prompt families demonstrate superior behavioral coherence of concept-aligned groups over geometric baselines (2.3x token-consistency; 5.8x activation-pattern similarity) and reveal a layerwise hierarchy where early layers generalize across entity substitutions while late layers specialize for output promotion. The approach promises substantial speedups for first-pass analysis, standardizes a taxonomy for reporting circuit motifs, and provides a foundation for safety-oriented interpretability research, with public code and interactive demos to encourage community adoption.

Abstract

Mechanistic interpretability aims to understand neural networks by identifying which learned features mediate specific behaviors. Attribution graphs reveal these feature pathways, but interpreting them requires extensive manual analysis -- a single prompt can take approximately 2 hours for an experienced circuit tracer. We present probe prompting, an automated pipeline that transforms attribution graphs into compact, interpretable subgraphs built from concept-aligned supernodes. Starting from a seed prompt and target logit, we select high-influence features, generate concept-targeted yet context-varying probes, and group features by cross-prompt activation signatures into Semantic, Relationship, and Say-X categories using transparent decision rules. Across five prompts including classic "capitals" circuits, probe-prompted subgraphs preserve high explanatory coverage while compressing complexity (Completeness 0.83, mean across circuits; Replacement 0.54). Compared to geometric clustering baselines, concept-aligned groups exhibit higher behavioral coherence: 2.3x higher peak-token consistency (0.425 vs 0.183) and 5.8x higher activation-pattern similarity (0.762 vs 0.130), despite lower geometric compactness. Entity-swap tests reveal a layerwise hierarchy: early-layer features transfer robustly (64% transfer rate, mean layer 6.3), while late-layer Say-X features specialize for output promotion (mean layer 16.4), supporting a backbone-and-specialization view of transformer computation. We release code (https://github.com/peppinob-ol/attribution-graph-probing), an interactive demo (https://huggingface.co/spaces/Peppinob/attribution-graph-probing), and minimal artifacts enabling immediate reproduction and community adoption.

Paper Structure

This paper contains 82 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Probe prompting pipeline overview. Four-stage process: (1) Attribution Graph: Start with Neuronpedia graph (600+ features), select high-influence features via cumulative threshold $\tau$. (2) Probe Prompts: LLM generates 5--10 concept-targeted prompts preserving syntactic structure while varying semantic content (e.g., "capital of Texas" $\rightarrow$ "capital of California"). (3) Behavioral Signatures: Measure feature activations across all probes, computing peak-token consistency, sparsity, and semantic vs functional peak percentage. (4) Interpretable Supernodes: Apply transparent decision rules to classify features into Semantic (content detectors), Relationship (structural binding), and Say-X (output promotion) categories. This abstraction hierarchy trades raw graph completeness (83% vs 90%) for legibility and actionable insight: researchers can quickly identify entity detectors, relational binding features, and output promoters without hours of manual activation inspection. Typical processing time: 2--5 minutes for graphs with 50--200 features.
  • Figure 2: Activation characteristics distinguish feature categories across layers.(Left) Violin plot showing activation magnitude increases with context coverage (n_ctx). Multi-context features exhibit higher mean activation. (Right) Scatter plot reveals near-zero correlation between activation strength and causal influence (r=0.02, $\rho$=-0.12), demonstrating that highly-activating features are not necessarily causally important. Color indicates context breadth; size reflects influence.
  • Figure 3: Rule-based feature typing for probe-prompted subgraphs. Decision rules used to assign each feature to one of four categories—Semantic (Dictionary), Semantic (Concept), Relationship, and Say "X"—based on cross-prompt behavior. Primary thresholds use peak consistency and number of distinct peaks for stable token detectors; median sparsity and early-layer bias for relationship-type binders; and the balance of functional-vs-semantic peaks with high functional confidence for output-promotion features, typically in later layers. Secondary conditions act as tie-breakers (e.g., early-layer preference for semantic detectors; early-binding layers for relationships; late-layer prior for Say "X"). Metric shorthand: peak consistency = share of probes with the same peak token; semantic_conf = probability mass on semantic targets; func_vs_sem = proportion of functional vs semantic activations; median_sparsity = median zero-fraction across probes; confidence_functional = classifier confidence of functional behavior. Written to be standalone per common journal guidance for captions.
  • Figure 4: Stacked-prompts activation map and Cross-Prompt Activation Signature (CPAS) for Feature 20-clt-hp:74108 (“Say capital”). Rows are generated probe prompts; columns are tokens. Shading shows feature activation. The feature fires on functional tokens—most strongly on is and the—across diverse contexts and reaches its global maximum immediately before capital (63.31). Its CPAS—a summary of behavior over probe prompts built from the peak-token histogram, activation density, sparsity, and semantic-vs-functional balance—shows high functional peak consistency with low variance across prompts. This pattern aligns with an output-promotion (“say-X”) role rather than semantic detection and matches the broader observation that attribution graphs often include output features that directly up-weight specific next tokens.
  • Figure 5: Cross-prompt transfer validation (Dallas$\rightarrow$Oakland entity swap). Heatmap shows supernode activation overlap between Dallas$\rightarrow$Austin (Texas, x-axis) and Oakland$\rightarrow$Sacramento (California, y-axis). Cell color indicates normalized activation overlap (dark red = high bilateral transfer $\sim$0.15--0.20, orange = unilateral, light = no overlap). Universal features ('is', 'capital', 'containing') show high bilateral transfer, indicating early-layer backbone computation. Entity-specific features ('Texas', 'Dallas', 'California', 'Oakland') show unilateral activation (orange blocks), confirming entity specificity. Say-X output promoters ('Say Austin', 'Say Sacramento') are target-appropriate (bottom rows), appearing only for correct circuit. This pattern supports the early-vs-late hierarchy: early layers encode transferable relational structure, late layers specialize for specific output tokens. Transfer rate (64%, 25/39 features) and 10.1-layer difference between transferred (mean 6.3) vs failed (mean 16.4) features provide quantitative evidence for this stratification.