Table of Contents
Fetching ...

$C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal

Aditya Kasliwal, Pratinav Seth, Vinay Kumar Sankarapu

TL;DR

C-ΔΘ introduces circuit-restricted weight editing to move selective refusal offline from inference time to a one-time checkpoint update. By localizing refusal computation to a sparse circuit via EAP-IG and applying a constrained weight update Δθ_C to only circuit parameters, the method yields a drop-in edited checkpoint θ' with no runtime hooks. Across six models and five harm categories, the approach achieves strong harmful-refusal gains with minimal over-refusal and negligible utility degradation, outperforming or matching inference-time baselines in many settings. The deployment-friendly design reduces serving overhead, enables auditable safety controls, and generalizes to multi-category targeting and out-of-distribution prompts, highlighting a practical path for scalable, mechanistic safety in LLMs.

Abstract

Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-Δθ: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update ΔθC supported only on that circuit (typically <5% of parameters). Applying ΔθC yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.

$C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal

TL;DR

C-ΔΘ introduces circuit-restricted weight editing to move selective refusal offline from inference time to a one-time checkpoint update. By localizing refusal computation to a sparse circuit via EAP-IG and applying a constrained weight update Δθ_C to only circuit parameters, the method yields a drop-in edited checkpoint θ' with no runtime hooks. Across six models and five harm categories, the approach achieves strong harmful-refusal gains with minimal over-refusal and negligible utility degradation, outperforming or matching inference-time baselines in many settings. The deployment-friendly design reduces serving overhead, enables auditable safety controls, and generalizes to multi-category targeting and out-of-distribution prompts, highlighting a practical path for scalable, mechanistic safety in LLMs.

Abstract

Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-Δθ: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update ΔθC supported only on that circuit (typically <5% of parameters). Applying ΔθC yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.
Paper Structure (56 sections, 8 equations, 2 figures, 13 tables, 2 algorithms)

This paper contains 56 sections, 8 equations, 2 figures, 13 tables, 2 algorithms.

Figures (2)

  • Figure 1: Targeted Behavioral Steering via Circuit-Restricted Weight Editing. Comparison of model responses to a "Legal Opinion" safety prompt. The Base Model (left) complies with the unsafe request, while the Steered Model (right), optimized using C-$\Delta\Theta$, successfully refuses. This demonstrates effective harmful behavior removal through weight updates alone, without inference-time interventions.
  • Figure 2: $C-\Delta\Theta$ : Circuit Restricted Weight Arithmetic). (1) Construct contrastive prompt pairs with matched topic/style but different desired policy outcomes (refuse vs. comply). (2) Localize refusal-causal computation using EAP-IG and extract a sparse circuit mask. (3) Perform an offline, circuit-restricted weight update to produce a drop-in edited checkpoint that requires no inference-time hooks.