Table of Contents
Fetching ...

SCAN: Sparse Circuit Anchor Interpretable Neuron for Lifelong Knowledge Editing

Yuhuan Liu, Haitian Zhong, Xinyuan Xia, Qiang Liu, Shu Wu, Liang Wang

Abstract

Large Language Models (LLMs) often suffer from catastrophic forgetting and collapse during sequential knowledge editing. This vulnerability stems from the prevailing dense editing paradigm, which treats models as black boxes and relies on coarse-grained parameter interventions that inevitably disrupt preserved knowledge. To address this, we propose SCAN (a sparse editing framework based on Sparse Circuit Anchored Neuron) which transforms editing into a mechanism-aware manipulation by constructing a knowledge circuit via Sparse Transcoders. Experiments on Gemma2, Qwen3, and Llama3.1 across CounterFact, ZsRE and WikiFactDiff demonstrate that SCAN achieves a superior performance, maintaining model integrity on benchmarks like MMLU and GSM8K even after 3,000 sequential edits, whereas other existing methods deteriorate progressively as editing accumulates, eventually resulting in model collapse.

SCAN: Sparse Circuit Anchor Interpretable Neuron for Lifelong Knowledge Editing

Abstract

Large Language Models (LLMs) often suffer from catastrophic forgetting and collapse during sequential knowledge editing. This vulnerability stems from the prevailing dense editing paradigm, which treats models as black boxes and relies on coarse-grained parameter interventions that inevitably disrupt preserved knowledge. To address this, we propose SCAN (a sparse editing framework based on Sparse Circuit Anchored Neuron) which transforms editing into a mechanism-aware manipulation by constructing a knowledge circuit via Sparse Transcoders. Experiments on Gemma2, Qwen3, and Llama3.1 across CounterFact, ZsRE and WikiFactDiff demonstrate that SCAN achieves a superior performance, maintaining model integrity on benchmarks like MMLU and GSM8K even after 3,000 sequential edits, whereas other existing methods deteriorate progressively as editing accumulates, eventually resulting in model collapse.
Paper Structure (40 sections, 5 theorems, 41 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 40 sections, 5 theorems, 41 equations, 7 figures, 3 tables, 1 algorithm.

Key Result

Proposition 3.2

Let $X,Y\subset \mathbb{R}^n$ be two spaces and let $f: X \to Y$ be a mapping such that $f(0)=0$ and $f$ is differentiable at every point $x_0\in X$ with Jacobian matrix $J_f(x_0)$ and $J_f$ is continuous and non-singular at $0$. Then, we have This implies that the direction of any vector transformed by $f$ is closely aligned with the direction induced by the Jacobian transformation.

Figures (7)

  • Figure 1: Comparison of current methods and ours. Current methods (a) modify the entire dense MLP weight matrix. Our approach (b) isolates factual features, editing knowledge-relevant vectors.
  • Figure 2: Cumulative proportion of selected feature across different token positions. (a) and (b) represent the distribution for Gemma2-2B and Qwen3-8B on CounterFact dataset, respectively.
  • Figure 3: Distribution of selected feature across layers. Both models exhibit a characteristic dual-peak pattern, indicating functional localization in shallow and middle-to-deep layers.
  • Figure 4: Heatmap of selected feature distribution across layers at special token position. The dark regions indicate that the early-layer peaks in Figure \ref{['fig:layer_distribution_combined']} align with the subject tokens, while the later-layer peaks correspond to the last token position on both models.
  • Figure 5: Activation visualization of identified features on the specific prompts. The left column shows Feature #13366 at Layer 19, and the right column shows Feature #410 at Layer 24. Darker colors indicate higher activation values.
  • ...and 2 more figures

Theorems & Definitions (12)

  • Definition 3.1: Initiation of Attribution Graph
  • Proposition 3.2: Jacobian as the Optimal Direction-Preserving Linearization
  • Definition 3.3: Direct (one-step) Attribution Matrix
  • Theorem 3.4: Full-derivative expansion
  • Proposition 3.5: Closed-form Total Attribution Matrix
  • Lemma 2.1: Stability of normalization
  • proof
  • proof
  • Lemma 2.2: Convergence of Powers of $A$
  • proof
  • ...and 2 more