Table of Contents
Fetching ...

SafeSeek: Universal Attribution of Safety Circuits in Language Models

Miao Yu, Siyuan Fu, Moayad Aloqaily, Zhenhong Zhou, Safa Otoum, Xing fan, Kun Wang, Yufei Guo, Qingsong Wen

Abstract

Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits in LLMs via optimization. Unlike methods focusing on isolated heads or neurons, \ourmethod introduces differentiable binary masks to extract multi-granular circuits through gradient descent on safety datasets, while integrates Safety Circuit Tuning to utilize these sparse circuits for efficient safety fine-tuning. We validate \ourmethod in two key scenarios in LLM safety: \textbf{(1) backdoor attacks}, identifying a backdoor circuit with 0.42\% sparsity, whose ablation eradicates the Attack Success Rate (ASR) from 100\% $\to$ 0.4\% while retaining over 99\% general utility; \textbf{(2) safety alignment}, localizing an alignment circuit with 3.03\% heads and 0.79\% neurons, whose removal spikes ASR from 0.8\% $\to$ 96.9\%, whereas excluding this circuit during helpfulness fine-tuning maintains 96.5\% safety retention.

SafeSeek: Universal Attribution of Safety Circuits in Language Models

Abstract

Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits in LLMs via optimization. Unlike methods focusing on isolated heads or neurons, \ourmethod introduces differentiable binary masks to extract multi-granular circuits through gradient descent on safety datasets, while integrates Safety Circuit Tuning to utilize these sparse circuits for efficient safety fine-tuning. We validate \ourmethod in two key scenarios in LLM safety: \textbf{(1) backdoor attacks}, identifying a backdoor circuit with 0.42\% sparsity, whose ablation eradicates the Attack Success Rate (ASR) from 100\% 0.4\% while retaining over 99\% general utility; \textbf{(2) safety alignment}, localizing an alignment circuit with 3.03\% heads and 0.79\% neurons, whose removal spikes ASR from 0.8\% 96.9\%, whereas excluding this circuit during helpfulness fine-tuning maintains 96.5\% safety retention.
Paper Structure (50 sections, 20 equations, 6 figures, 3 tables)

This paper contains 50 sections, 20 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of our SafeSeek framework (Middle), with two critical scenarios in LLM safety (Left) and case studies (Right).
  • Figure 2: Performance comparison of the base LLM ($\mathcal{G}_{\mathrm{base}}$), the backdoor model ($\mathcal{G}_{\mathrm{bkd}}$), SafeSeek's clean model ($\mathcal{G}_{\mathrm{clean}}$) and backdoor circuit ($\mathcal{C}_{\mathrm{bkd}}$) for the Refusal backdoor.
  • Figure 3: ASR and general utility of base ($\mathcal{G}$) and SafeSeek's unsafe model ($\mathcal{G}_{\mathrm{unsafe}}$) in the safety alignment scenario.
  • Figure 4: Performance comparison of different models and methods to enhance LLM safety via fine-tuning (LoRA is the baseline).
  • Figure 5: Performance comparison for additional backdoor types. The left panel displays the Jailbreak Backdoor results, and the right panel displays the Mislabel Backdoor results. Across both tasks, SafeSeek identifies highly sparse circuits ($\mathcal{C}_{\mathrm{bkd}}$) that capture high attack capabilities ($\text{ASR}_p > 90\%$) while the clean subgraphs ($\mathcal{G}_{\mathrm{clean}}$) effectively mitigate the attacks.
  • ...and 1 more figures