CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles

Swapnil Parekh

CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles

Swapnil Parekh

TL;DR

This work constructs an ensemble of attribution graphs by pruning a single raw attribution run under multiple configurations, assigns each edge a stability score, and extracts a strict-consensus circuit consisting only of edges that appear in all views, which produces a threshold-robust"core"circuit.

Abstract

Mechanistic circuit discovery is notoriously sensitive to arbitrary analyst choices, especially pruning thresholds and feature dictionaries, often yielding brittle "one-shot" explanations with no principled notion of uncertainty. We reframe circuit discovery as an uncertainty-quantification problem over these analytic degrees of freedom. Our method, CIRCUS, constructs an ensemble of attribution graphs by pruning a single raw attribution run under multiple configurations, assigns each edge a stability score (the fraction of configurations that retain it), and extracts a strict-consensus circuit consisting only of edges that appear in all views. This produces a threshold-robust "core" circuit while explicitly surfacing contingent alternatives and enabling rejection of low-agreement structure. CIRCUS requires no retraining and adds negligible overhead, since it aggregates structure across already-computed pruned graphs. On Gemma-2-2B and Llama-3.2-1B, strict consensus circuits are ~40x smaller than the union of all configurations while retaining comparable influence-flow explanatory power, and they outperform a same-edge-budget baseline (union pruned to match the consensus size). We further validate causal relevance with activation patching, where consensus-identified nodes consistently beat matched non-consensus controls (p=0.0004). Overall, CIRCUS provides a practical, uncertainty-aware framework for reporting trustworthy, auditable mechanistic circuits with an explicit core/contingent/noise decomposition.

CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles

TL;DR

Abstract

Paper Structure (22 sections, 3 equations, 8 figures, 4 tables)

This paper contains 22 sections, 3 equations, 8 figures, 4 tables.

Introduction
Related Work
Methodology
Notation and Problem Formulation
Attribution Graphs and Pruning
Config-Bagging and Uncertainty Decomposition
Boosting: Residual Uncertainty Test
Experiment Design
Results
Threshold-Sweep and Consensus
Boosting: Core, Residual, and Full
Multi-Prompt and Single-Baseline Trade-off
Uncertainty Analyses: Rejection, Alternatives, Variance
Discussion and Limitations
Conclusion
...and 7 more sections

Figures (8)

Figure 1: Bagged attribution pipeline. One raw attribution graph is pruned under $B$ configurations to yield multiple views; edges receive stability scores $s(e)$. Strict consensus $C_{\tau=1}$ keeps only edges present in all views (solid); dashed edges indicate contingent alternatives. Boosting and causal validation (activation patching) are in Section \ref{['sec:boosting']} and Section \ref{['sec:res-uncertainty']} (Fig. \ref{['fig:activation-patching']}).
Figure 2: Activation patching recovery ($n=20$ prompts). Per-prompt recovery difference (patched $-$ corrupted logit): left consensus $-$ random control; right consensus $-$ matched baseline. Consensus outperforms both baselines (18/20 and 17/20 prompts). Mean paired difference (95% CI): 12.6 [6.8, 17.6] vs. control; 17.8 [10.8, 23.1] vs. matched. The horizontal line marks oracle recovery (ceiling).
Figure 3: Stability with $B=25$ configs. Left: Edge stability histogram. Right: Consensus size $|C_\tau|$ and $\mathrm{IR}(C_\tau)$ vs. $\tau$ (elbow curve); shaded bands at strict consensus ($\tau=1$) show bootstrap 95% CI over configs.
Figure 4: Config-drop (leave-one-config-out): consensus edge count and influence retained when each pruning config is omitted. Dropping the loosest config changes consensus substantially; dropping a stricter config does not.
Figure 5: Stability vs. influence. X-axis: edge stability $s(e)$. Left Y-axis: count of edges (histogram). Right Y-axis: mean influence per edge (line with error bars). High-stability edges are rare but carry the most influence, supporting threshold-robust consensus as the primary circuit.
...and 3 more figures

CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles

TL;DR

Abstract

CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles

Authors

TL;DR

Abstract

Table of Contents

Figures (8)