Task-Agnostic Detector for Insertion-Based Backdoor Attacks

Weimin Lyu; Xiao Lin; Songzhu Zheng; Lu Pang; Haibin Ling; Susmit Jha; Chao Chen

Task-Agnostic Detector for Insertion-Based Backdoor Attacks

Weimin Lyu, Xiao Lin, Songzhu Zheng, Lu Pang, Haibin Ling, Susmit Jha, Chao Chen

TL;DR

TABDet (Task-Agnostic Backdoor Detector), a pioneering task-agnostic method for backdoor detection, leverages final layer logits combined with an efficient pooling technique, enabling unified logit representation across three prominent NLP tasks.

Abstract

Textual backdoor attacks pose significant security threats. Current detection approaches, typically relying on intermediate feature representation or reconstructing potential triggers, are task-specific and less effective beyond sentence classification, struggling with tasks like question answering and named entity recognition. We introduce TABDet (Task-Agnostic Backdoor Detector), a pioneering task-agnostic method for backdoor detection. TABDet leverages final layer logits combined with an efficient pooling technique, enabling unified logit representation across three prominent NLP tasks. TABDet can jointly learn from diverse task-specific models, demonstrating superior detection efficacy over traditional task-specific methods.

Task-Agnostic Detector for Insertion-Based Backdoor Attacks

TL;DR

Abstract

Paper Structure (22 sections, 1 equation, 7 figures, 6 tables, 2 algorithms)

This paper contains 22 sections, 1 equation, 7 figures, 6 tables, 2 algorithms.

Introduction
Related Work
TABDet
Logit Features Extraction
Technical Details
Justification: Logit Features Reveal Backdoors
Representation Refinement
Rationale: Representation Refinement Strategy
Backdoor Detector
Experiments
Experimental Settings
Detection Results
Ablation Study
Conclusion
Appendix
...and 7 more sections

Figures (7)

Figure 1: In the left Table, the clean model's prediction for an input sample is positive with high confidence, as indicated by a substantial log-softmax value. Conversely, the backdoored model shows low confidence in the correct positive label, reflected by a diminished log-softmax value. In the right Figure, given input samples, we plot log-softmax values of ground truth label from both clean (green stars) and backdoored (red dots) models, highlighting a distinct separation in logits distribution. y axis represents the log-softmax value, x axis represents the value count. For brevity, logit value will be used throughout the paper to refer to log-softmax logit value.
Figure 2: 1) Histogram of model's final layer logits (log-softmax) given trigger candidates. Histogram (only plot the lowest $0.01\%$ value) shows clear gap between clean models and backdoored models. 2) t-SNE visualization of logit features prior to feature refinement, illustrating indistinct clustering. 3) Post-refinement t-SNE visualization, showing improved distinction between clean and poisoned models. 4) t-SNE plot of features extracted from the learnable backdoor detector's intermediate layer, indicating further enhancement in the separability of representations from clean and backdoored models.
Figure 3: The overall TABDet framework consists of three key components: the Logit Features Extraction module, which extracts the final layer logits from a given model; the Representation Refinement module, which utilizes histogram and quantile pooling to produce high-quality, task-consistent representations; and the Backdoor Detector, which employs a simple MLP classifier to accurately distinguish between clean and trojan models. This architecture ensures robust backdoor detection across various NLP tasks.
Figure 4: The histogram illustrates logit distributions for the ground truth label across three NLP tasks, differentiating between clean and backdoored models. x axis is the logit values, y axis is the count of logits in corresponding bins. Top Row shows clear separation in logit values when real triggers are used. Bottom Row, with a large set of trigger candidates $\Delta$ (only display the lowest 0.01% values), reveals persisting abnormal logit behaviors in backdoored models, demonstrating the robustness of logits as indicators of model integrity.
Figure 5: The refined feature representations effectively differentiate between clean and backdoored models across various NLP tasks. Each color on the figure corresponds to a unique model, with the plotted points indicating individual feature values after refinement in one model. The x-axis labels the feature indices, and the y-axis their corresponding values. The distributions are not only efficient in separation but also exhibit consistency across various NLP tasks, highlighting the effectiveness of the feature refinement process.
...and 2 more figures

Task-Agnostic Detector for Insertion-Based Backdoor Attacks

TL;DR

Abstract

Task-Agnostic Detector for Insertion-Based Backdoor Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)