Table of Contents
Fetching ...

The Anatomy of an Edit: Mechanism-Guided Activation Steering for Knowledge Editing

Yuan Cao, Mingyang Wang, Hinrich Schütze

Abstract

Large language models (LLMs) are increasingly used as knowledge bases, but keeping them up to date requires targeted knowledge editing (KE). However, it remains unclear how edits are implemented inside the model once applied. In this work, we take a mechanistic view of KE using neuron-level knowledge attribution (NLKA). Unlike prior work that focuses on pre-edit causal tracing and localization, we use post-edit attribution -- contrasting successful and failed edits -- to isolate the computations that shift when an edit succeeds. Across representative KE methods, we find a consistent pattern: mid-to-late attention predominantly promotes the new target, while attention and FFN modules cooperate to suppress the original fact. Motivated by these findings, we propose MEGA, a MEchanism-Guided Activation steering method that performs attention-residual interventions in attribution-aligned regions without modifying model weights. On CounterFact and Popular, MEGA achieves strong editing performance across KE metrics on GPT2-XL and LLaMA2-7B. Overall, our results elevate post-edit attribution from analysis to engineering signal: by pinpointing where and how edits take hold, it powers MEGA to deliver reliable, architecture-agnostic knowledge edits.

The Anatomy of an Edit: Mechanism-Guided Activation Steering for Knowledge Editing

Abstract

Large language models (LLMs) are increasingly used as knowledge bases, but keeping them up to date requires targeted knowledge editing (KE). However, it remains unclear how edits are implemented inside the model once applied. In this work, we take a mechanistic view of KE using neuron-level knowledge attribution (NLKA). Unlike prior work that focuses on pre-edit causal tracing and localization, we use post-edit attribution -- contrasting successful and failed edits -- to isolate the computations that shift when an edit succeeds. Across representative KE methods, we find a consistent pattern: mid-to-late attention predominantly promotes the new target, while attention and FFN modules cooperate to suppress the original fact. Motivated by these findings, we propose MEGA, a MEchanism-Guided Activation steering method that performs attention-residual interventions in attribution-aligned regions without modifying model weights. On CounterFact and Popular, MEGA achieves strong editing performance across KE metrics on GPT2-XL and LLaMA2-7B. Overall, our results elevate post-edit attribution from analysis to engineering signal: by pinpointing where and how edits take hold, it powers MEGA to deliver reliable, architecture-agnostic knowledge edits.
Paper Structure (56 sections, 4 equations, 17 figures, 10 tables)

This paper contains 56 sections, 4 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: Overview of MEGA.(Left): Post-edit NLKA identifies mid-to-late attention layers with high leverage on the target vs. original token. (Middle): MEGA uses this signal to select an edit zone and (Right): applies PCA-stabilized attention–residual steering at inference time, promoting the edited fact without changing weights.
  • Figure 2: MEMIT success vs. failure. Mean contribution differences across multiple cases; positive promotes new target, negative suppresses original.
  • Figure 3: IKE success vs. failure on CounterFact (GPT2-XL).
  • Figure 4: ROME success vs. failure on CounterFact (GPT2-XL).
  • Figure 5: Fine-tuning success vs. failure on CounterFact (GPT2-XL).
  • ...and 12 more figures