Table of Contents
Fetching ...

Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models

Anindya Sundar Das, Kangjie Chen, Monowar Bhuyan

TL;DR

This work tackles backdoor threats in pre-trained language models by proposing X-GRAAD, an explainable inference-time defense that uses token-level attribution from both attention and input gradients to detect anomalous trigger tokens. By computing a joint anomaly score $\psi(x)$ and applying a targeted noise perturbation to suspicious tokens, the method neutralizes backdoors without retraining and provides trigger localization explanations. Extensive experiments across multiple transformer architectures, tasks, and backdoor attacks show that X-GRAAD achieves low attack success rates while maintaining competitive clean accuracy, outperforming several baselines in both effectiveness and interpretability. The approach offers practical, scalable defenses with transparent insights into model behavior under adversarial manipulation, highlighting the value of attribution-based strategies for secure NLP deployment.

Abstract

Pre-trained language models have achieved remarkable success across a wide range of natural language processing (NLP) tasks, particularly when fine-tuned on large, domain-relevant datasets. However, they remain vulnerable to backdoor attacks, where adversaries embed malicious behaviors using trigger patterns in the training data. These triggers remain dormant during normal usage, but, when activated, can cause targeted misclassifications. In this work, we investigate the internal behavior of backdoored pre-trained encoder-based language models, focusing on the consistent shift in attention and gradient attribution when processing poisoned inputs; where the trigger token dominates both attention and gradient signals, overriding the surrounding context. We propose an inference-time defense that constructs anomaly scores by combining token-level attention and gradient information. Extensive experiments on text classification tasks across diverse backdoor attack scenarios demonstrate that our method significantly reduces attack success rates compared to existing baselines. Furthermore, we provide an interpretability-driven analysis of the scoring mechanism, shedding light on trigger localization and the robustness of the proposed defense.

Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models

TL;DR

This work tackles backdoor threats in pre-trained language models by proposing X-GRAAD, an explainable inference-time defense that uses token-level attribution from both attention and input gradients to detect anomalous trigger tokens. By computing a joint anomaly score and applying a targeted noise perturbation to suspicious tokens, the method neutralizes backdoors without retraining and provides trigger localization explanations. Extensive experiments across multiple transformer architectures, tasks, and backdoor attacks show that X-GRAAD achieves low attack success rates while maintaining competitive clean accuracy, outperforming several baselines in both effectiveness and interpretability. The approach offers practical, scalable defenses with transparent insights into model behavior under adversarial manipulation, highlighting the value of attribution-based strategies for secure NLP deployment.

Abstract

Pre-trained language models have achieved remarkable success across a wide range of natural language processing (NLP) tasks, particularly when fine-tuned on large, domain-relevant datasets. However, they remain vulnerable to backdoor attacks, where adversaries embed malicious behaviors using trigger patterns in the training data. These triggers remain dormant during normal usage, but, when activated, can cause targeted misclassifications. In this work, we investigate the internal behavior of backdoored pre-trained encoder-based language models, focusing on the consistent shift in attention and gradient attribution when processing poisoned inputs; where the trigger token dominates both attention and gradient signals, overriding the surrounding context. We propose an inference-time defense that constructs anomaly scores by combining token-level attention and gradient information. Extensive experiments on text classification tasks across diverse backdoor attack scenarios demonstrate that our method significantly reduces attack success rates compared to existing baselines. Furthermore, we provide an interpretability-driven analysis of the scoring mechanism, shedding light on trigger localization and the robustness of the proposed defense.

Paper Structure

This paper contains 30 sections, 8 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the proposed X-GRAAD framework. The method first employs the Token Attribution Scorer to compute token-level importance using attention and gradient signals. Samples with anomaly scores above the threshold $\psi(x) > \tau$ are processed by the Trigger Neutralizer and Defender, where suspicious tokens are perturbed via noise injection before generating the final predictions.
  • Figure 2: Distribution of attribution-based scores for clean and poisoned samples on the SST-2 dataset. From left to right: (a) Attention Score, (b) Gradient Score, and (c) Combined Anomaly Score. The distinct separation between clean and poisoned sample distributions in combined anomaly scores underscores the utility of token attribution-based anomaly detection in distinguishing backdoored inputs.
  • Figure 3: Robustness evaluation under varying anomaly thresholds. Top: LWS attack on SST-2, Bottom: BADNET attack on IMDb.
  • Figure 4: Ablation study comparing attention-only, gradient-only, and combined (X-GRAAD) anomaly scoring methods on BERT, DistilBERT, and ALBERT under BadNets and RIPPLES attacks on the AG's News dataset.
  • Figure 5: Token-level attribution scores on poisoned SST-2 dataset. Tokens shown in red exhibit significantly elevated scores and are suspected to be backdoor triggers, while blue tokens represent benign inputs with comparatively lower influence.