Circuit Breaking: Removing Model Behaviors with Targeted Ablation
Maximilian Li, Xander Davies, Max Nadeau
TL;DR
<3-5 sentence high-level summary> Introduces targeted edge ablation as a method to remove undesirable pretraining-induced behaviors by treating models as computation graphs and learning a sparse mask over edges to disable causal pathways responsible for the behavior at inference. The approach emphasizes limited expressivity and structure preservation, contrasting with finetuning or task arithmetic, and demonstrates practical results by reducing GPT-2 toxicity with a small set of edge ablations. Experimental setup combines a causal-graph representation, continuous edge masks, and a toxicity-focused dataset to measure efficacy and specificity, showing that 12 zero-ablated edges can substantially reduce toxic generation with minimal degradation to other behaviors. Overall, the paper presents edge-level circuit breaking as a data-efficient, interpretable alternative to conventional fine-tuning for targeted behavioral modification in transformers.
Abstract
Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. In the setting of reducing GPT-2 toxic language generation, we find ablating just 12 of the 11.6K causal edges mitigates toxic generation with minimal degradation of performance on other inputs.
