Table of Contents
Fetching ...

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen

TL;DR

Backdoors in fine-tuned LLMs pose a major safety risk due to data poisoning during training. The authors propose Backdoor Attribution (BkdAttr), a tripartite causal framework comprising Backdoor Probe, BAHA (Backdoor Attention Head Attribution), and Backdoor Vector to diagnose and control backdoor mechanisms. They demonstrate that backdoor features exist in hidden representations and are progressively enriched across layers, with backdoor attention heads being sparse yet collectively influential; ablating roughly $\sim 3\%$ of heads can reduce ASR by about $\sim 90\%$, and a learned Backdoor Vector enables one-step manipulation to either activate or neutralize backdoors. This work delivers mechanistic interpretability for LLM backdoors and provides actionable methods for detection, analysis, and defense against such attacks.

Abstract

Fine-tuned Large Language Models (LLMs) are vulnerable to backdoor attacks through data poisoning, yet the internal mechanisms governing these attacks remain a black box. Previous research on interpretability for LLM safety tends to focus on alignment, jailbreak, and hallucination, but overlooks backdoor mechanisms, making it difficult to understand and fully eliminate the backdoor threat. In this paper, aiming to bridge this gap, we explore the interpretable mechanisms of LLM backdoors through Backdoor Attribution (BkdAttr), a tripartite causal analysis framework. We first introduce the Backdoor Probe that proves the existence of learnable backdoor features encoded within the representations. Building on this insight, we further develop Backdoor Attention Head Attribution (BAHA), efficiently pinpointing the specific attention heads responsible for processing these features. Our primary experiments reveals these heads are relatively sparse; ablating a minimal \textbf{$\sim$ 3%} of total heads is sufficient to reduce the Attack Success Rate (ASR) by \textbf{over 90%}. More importantly, we further employ these findings to construct the Backdoor Vector derived from these attributed heads as a master controller for the backdoor. Through only \textbf{1-point} intervention on \textbf{single} representation, the vector can either boost ASR up to \textbf{$\sim$ 100% ($\uparrow$)} on clean inputs, or completely neutralize backdoor, suppressing ASR down to \textbf{$\sim$ 0% ($\downarrow$)} on triggered inputs. In conclusion, our work pioneers the exploration of mechanistic interpretability in LLM backdoors, demonstrating a powerful method for backdoor control and revealing actionable insights for the community.

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

TL;DR

Backdoors in fine-tuned LLMs pose a major safety risk due to data poisoning during training. The authors propose Backdoor Attribution (BkdAttr), a tripartite causal framework comprising Backdoor Probe, BAHA (Backdoor Attention Head Attribution), and Backdoor Vector to diagnose and control backdoor mechanisms. They demonstrate that backdoor features exist in hidden representations and are progressively enriched across layers, with backdoor attention heads being sparse yet collectively influential; ablating roughly of heads can reduce ASR by about , and a learned Backdoor Vector enables one-step manipulation to either activate or neutralize backdoors. This work delivers mechanistic interpretability for LLM backdoors and provides actionable methods for detection, analysis, and defense against such attacks.

Abstract

Fine-tuned Large Language Models (LLMs) are vulnerable to backdoor attacks through data poisoning, yet the internal mechanisms governing these attacks remain a black box. Previous research on interpretability for LLM safety tends to focus on alignment, jailbreak, and hallucination, but overlooks backdoor mechanisms, making it difficult to understand and fully eliminate the backdoor threat. In this paper, aiming to bridge this gap, we explore the interpretable mechanisms of LLM backdoors through Backdoor Attribution (BkdAttr), a tripartite causal analysis framework. We first introduce the Backdoor Probe that proves the existence of learnable backdoor features encoded within the representations. Building on this insight, we further develop Backdoor Attention Head Attribution (BAHA), efficiently pinpointing the specific attention heads responsible for processing these features. Our primary experiments reveals these heads are relatively sparse; ablating a minimal \textbf{ 3%} of total heads is sufficient to reduce the Attack Success Rate (ASR) by \textbf{over 90%}. More importantly, we further employ these findings to construct the Backdoor Vector derived from these attributed heads as a master controller for the backdoor. Through only \textbf{1-point} intervention on \textbf{single} representation, the vector can either boost ASR up to \textbf{ 100% ()} on clean inputs, or completely neutralize backdoor, suppressing ASR down to \textbf{ 0% ()} on triggered inputs. In conclusion, our work pioneers the exploration of mechanistic interpretability in LLM backdoors, demonstrating a powerful method for backdoor control and revealing actionable insights for the community.

Paper Structure

This paper contains 25 sections, 16 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Brief introduction to LLM backdoors (Upper Left). Three main conclusions drawn from our experiments (Lower Left). Illustration of our proposed BkdAttr framework (Right).
  • Figure 2: The performance $\text{ICLA}(i,k)$ of Backdoor Probes. The left side shows the accuracy of SVM and MLP probes in identifying backdoor samples at the current layer (where $i=k$), while the right side displays the accuracy of one backdoor probe when applied to all layers.
  • Figure 3: The significance $\text{ACIE}(i,j)$ of attention heads for different backdoor-injected LLMs.
  • Figure 4: ASR when applying two properties of backdoor vectors on Llama2-7B with backdoors.
  • Figure 5: $\text{ICLA}(i,k)$ of Backdoor Probes (MLP) for Llama-2-7B-chat with label modification (agnews_sentence) and jailbreak (harmful_random) backdoor.
  • ...and 7 more figures