Table of Contents
Fetching ...

Circuit Breaking: Removing Model Behaviors with Targeted Ablation

Maximilian Li, Xander Davies, Max Nadeau

TL;DR

<3-5 sentence high-level summary> Introduces targeted edge ablation as a method to remove undesirable pretraining-induced behaviors by treating models as computation graphs and learning a sparse mask over edges to disable causal pathways responsible for the behavior at inference. The approach emphasizes limited expressivity and structure preservation, contrasting with finetuning or task arithmetic, and demonstrates practical results by reducing GPT-2 toxicity with a small set of edge ablations. Experimental setup combines a causal-graph representation, continuous edge masks, and a toxicity-focused dataset to measure efficacy and specificity, showing that 12 zero-ablated edges can substantially reduce toxic generation with minimal degradation to other behaviors. Overall, the paper presents edge-level circuit breaking as a data-efficient, interpretable alternative to conventional fine-tuning for targeted behavioral modification in transformers.

Abstract

Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. In the setting of reducing GPT-2 toxic language generation, we find ablating just 12 of the 11.6K causal edges mitigates toxic generation with minimal degradation of performance on other inputs.

Circuit Breaking: Removing Model Behaviors with Targeted Ablation

TL;DR

<3-5 sentence high-level summary> Introduces targeted edge ablation as a method to remove undesirable pretraining-induced behaviors by treating models as computation graphs and learning a sparse mask over edges to disable causal pathways responsible for the behavior at inference. The approach emphasizes limited expressivity and structure preservation, contrasting with finetuning or task arithmetic, and demonstrates practical results by reducing GPT-2 toxicity with a small set of edge ablations. Experimental setup combines a causal-graph representation, continuous edge masks, and a toxicity-focused dataset to measure efficacy and specificity, showing that 12 zero-ablated edges can substantially reduce toxic generation with minimal degradation to other behaviors. Overall, the paper presents edge-level circuit breaking as a data-efficient, interpretable alternative to conventional fine-tuning for targeted behavioral modification in transformers.

Abstract

Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. In the setting of reducing GPT-2 toxic language generation, we find ablating just 12 of the 11.6K causal edges mitigates toxic generation with minimal degradation of performance on other inputs.
Paper Structure (30 sections, 4 equations, 4 figures, 1 table)

This paper contains 30 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: In targeted ablation, we (1) rewrite our model as a computation graph of a desired granularity, (2) learn a binary mask over edges while regularizing to penalize ablations, and (3) ablate edges at inference time to avoid the target bad behavior.
  • Figure 2: Ablating GPT-2 Small to remove toxicity.Left: Grey nodes are attention heads, and purple nodes are MLPs. Computation proceeds upwards, with horizontal alignment corresponding to layers. The computational graph has 11,611 edges; red edges are the 12 ablations learned to remove toxicity. Right: Examples of improved non-toxic generation.
  • Figure 3: We can subdivide an attention head into its own computational graph.
  • Figure 4: The learned mask for MNIST classification over the course of training. Note that versions of this mask in the middle of training are allowed to partially ablate each edge, so "Edges Ablated" is calculated by summing the coefficients assigned to the ablation value. The "train" points are those that the MLP was trained on, and the "test" points are those it was not. The "bad behaviors" line indicates its accuracy on the 30 exemplar digits.

Theorems & Definitions (1)

  • Definition 3.1: Behavior Removal