Table of Contents
Fetching ...

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks

Haotian Jin, Yang Li, Haihui Fan, Lin Shen, Xiangfang Li, Bo Li

TL;DR

This work addresses the security risk of backdoor attacks in large language models by identifying a trigger-agnostic defense grounded in attention patterns. It shows that backdoored models exhibit abnormally high similarity among attention heads when a trigger appears, and leverages this insight to classify heads by combined importance and similarity. Through attention safety alignment and per-head fine-tuning, the method sanitizes suspicious heads while preserving downstream task performance, without needing knowledge of the trigger. The approach demonstrates robust ASR reduction across classification and generation tasks and various backdoor methods, offering a practical defense for deployment in real-world NLP systems.

Abstract

Backdoor attacks pose a serious threat to the security of large language models (LLMs), causing them to exhibit anomalous behavior under specific trigger conditions. The design of backdoor triggers has evolved from fixed triggers to dynamic or implicit triggers. This increased flexibility in trigger design makes it challenging for defenders to identify their specific forms accurately. Most existing backdoor defense methods are limited to specific types of triggers or rely on an additional clean model for support. To address this issue, we propose a backdoor detection method based on attention similarity, enabling backdoor detection without prior knowledge of the trigger. Our study reveals that models subjected to backdoor attacks exhibit unusually high similarity among attention heads when exposed to triggers. Based on this observation, we propose an attention safety alignment approach combined with head-wise fine-tuning to rectify potentially contaminated attention heads, thereby effectively mitigating the impact of backdoor attacks. Extensive experimental results demonstrate that our method significantly reduces the success rate of backdoor attacks while preserving the model's performance on downstream tasks.

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks

TL;DR

This work addresses the security risk of backdoor attacks in large language models by identifying a trigger-agnostic defense grounded in attention patterns. It shows that backdoored models exhibit abnormally high similarity among attention heads when a trigger appears, and leverages this insight to classify heads by combined importance and similarity. Through attention safety alignment and per-head fine-tuning, the method sanitizes suspicious heads while preserving downstream task performance, without needing knowledge of the trigger. The approach demonstrates robust ASR reduction across classification and generation tasks and various backdoor methods, offering a practical defense for deployment in real-world NLP systems.

Abstract

Backdoor attacks pose a serious threat to the security of large language models (LLMs), causing them to exhibit anomalous behavior under specific trigger conditions. The design of backdoor triggers has evolved from fixed triggers to dynamic or implicit triggers. This increased flexibility in trigger design makes it challenging for defenders to identify their specific forms accurately. Most existing backdoor defense methods are limited to specific types of triggers or rely on an additional clean model for support. To address this issue, we propose a backdoor detection method based on attention similarity, enabling backdoor detection without prior knowledge of the trigger. Our study reveals that models subjected to backdoor attacks exhibit unusually high similarity among attention heads when exposed to triggers. Based on this observation, we propose an attention safety alignment approach combined with head-wise fine-tuning to rectify potentially contaminated attention heads, thereby effectively mitigating the impact of backdoor attacks. Extensive experimental results demonstrate that our method significantly reduces the success rate of backdoor attacks while preserving the model's performance on downstream tasks.

Paper Structure

This paper contains 40 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The proportion of attention heads with cosine similarity greater than 0.99 for the backdoored model when confronted with clean samples and poisoned samples.
  • Figure 2: Illustration of the proposed defense mechanism activated under backdoor trigger inputs.
  • Figure 3: The effects of $\alpha$ on CA and ASR.
  • Figure 4: ASR and CA under different combinations of suspicious and safe head learning rates.
  • Figure 5: Effect of varying threshold $\tau$ on attack success rate and clean accuracy across different backdoor methods.