Table of Contents
Fetching ...

CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles

Swapnil Parekh

TL;DR

This work constructs an ensemble of attribution graphs by pruning a single raw attribution run under multiple configurations, assigns each edge a stability score, and extracts a strict-consensus circuit consisting only of edges that appear in all views, which produces a threshold-robust"core"circuit.

Abstract

Mechanistic circuit discovery is notoriously sensitive to arbitrary analyst choices, especially pruning thresholds and feature dictionaries, often yielding brittle "one-shot" explanations with no principled notion of uncertainty. We reframe circuit discovery as an uncertainty-quantification problem over these analytic degrees of freedom. Our method, CIRCUS, constructs an ensemble of attribution graphs by pruning a single raw attribution run under multiple configurations, assigns each edge a stability score (the fraction of configurations that retain it), and extracts a strict-consensus circuit consisting only of edges that appear in all views. This produces a threshold-robust "core" circuit while explicitly surfacing contingent alternatives and enabling rejection of low-agreement structure. CIRCUS requires no retraining and adds negligible overhead, since it aggregates structure across already-computed pruned graphs. On Gemma-2-2B and Llama-3.2-1B, strict consensus circuits are ~40x smaller than the union of all configurations while retaining comparable influence-flow explanatory power, and they outperform a same-edge-budget baseline (union pruned to match the consensus size). We further validate causal relevance with activation patching, where consensus-identified nodes consistently beat matched non-consensus controls (p=0.0004). Overall, CIRCUS provides a practical, uncertainty-aware framework for reporting trustworthy, auditable mechanistic circuits with an explicit core/contingent/noise decomposition.

CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles

TL;DR

This work constructs an ensemble of attribution graphs by pruning a single raw attribution run under multiple configurations, assigns each edge a stability score, and extracts a strict-consensus circuit consisting only of edges that appear in all views, which produces a threshold-robust"core"circuit.

Abstract

Mechanistic circuit discovery is notoriously sensitive to arbitrary analyst choices, especially pruning thresholds and feature dictionaries, often yielding brittle "one-shot" explanations with no principled notion of uncertainty. We reframe circuit discovery as an uncertainty-quantification problem over these analytic degrees of freedom. Our method, CIRCUS, constructs an ensemble of attribution graphs by pruning a single raw attribution run under multiple configurations, assigns each edge a stability score (the fraction of configurations that retain it), and extracts a strict-consensus circuit consisting only of edges that appear in all views. This produces a threshold-robust "core" circuit while explicitly surfacing contingent alternatives and enabling rejection of low-agreement structure. CIRCUS requires no retraining and adds negligible overhead, since it aggregates structure across already-computed pruned graphs. On Gemma-2-2B and Llama-3.2-1B, strict consensus circuits are ~40x smaller than the union of all configurations while retaining comparable influence-flow explanatory power, and they outperform a same-edge-budget baseline (union pruned to match the consensus size). We further validate causal relevance with activation patching, where consensus-identified nodes consistently beat matched non-consensus controls (p=0.0004). Overall, CIRCUS provides a practical, uncertainty-aware framework for reporting trustworthy, auditable mechanistic circuits with an explicit core/contingent/noise decomposition.
Paper Structure (22 sections, 3 equations, 8 figures, 4 tables)

This paper contains 22 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Bagged attribution pipeline. One raw attribution graph is pruned under $B$ configurations to yield multiple views; edges receive stability scores $s(e)$. Strict consensus $C_{\tau=1}$ keeps only edges present in all views (solid); dashed edges indicate contingent alternatives. Boosting and causal validation (activation patching) are in Section \ref{['sec:boosting']} and Section \ref{['sec:res-uncertainty']} (Fig. \ref{['fig:activation-patching']}).
  • Figure 2: Activation patching recovery ($n=20$ prompts). Per-prompt recovery difference (patched $-$ corrupted logit): left consensus $-$ random control; right consensus $-$ matched baseline. Consensus outperforms both baselines (18/20 and 17/20 prompts). Mean paired difference (95% CI): 12.6 [6.8, 17.6] vs. control; 17.8 [10.8, 23.1] vs. matched. The horizontal line marks oracle recovery (ceiling).
  • Figure 3: Stability with $B=25$ configs. Left: Edge stability histogram. Right: Consensus size $|C_\tau|$ and $\mathrm{IR}(C_\tau)$ vs. $\tau$ (elbow curve); shaded bands at strict consensus ($\tau=1$) show bootstrap 95% CI over configs.
  • Figure 4: Config-drop (leave-one-config-out): consensus edge count and influence retained when each pruning config is omitted. Dropping the loosest config changes consensus substantially; dropping a stricter config does not.
  • Figure 5: Stability vs. influence. X-axis: edge stability $s(e)$. Left Y-axis: count of edges (histogram). Right Y-axis: mean influence per edge (line with error bars). High-stability edges are rare but carry the most influence, supporting threshold-robust consensus as the primary circuit.
  • ...and 3 more figures