Table of Contents
Fetching ...

The Ultimate Cookbook for Invisible Poison: Crafting Subtle Clean-Label Text Backdoors with Style Attributes

Wencong You, Daniel Lowd

TL;DR

This work introduces AttrBkd, a clean-label backdoor framework for text classification that uses fine-grained stylistic attributes as triggers to achieve high stealth and strong attack effectiveness. It combines three attribute-gathering recipes (Baseline-Derived, LISA Embedding Outliers, and Sample-Inspired) and validates them with comprehensive human evaluations via an AIR metric, showing humans consistently prefer AttrBkd over conspicuous baselines. The study demonstrates AttrBkd attains competitive $ASR$ across SST-2, AG News, and Blog, while being harder to detect by automated defenses and human annotators, exposing gaps in current evaluation metrics. The results argue for a holistic evaluation paradigm that integrates human judgment with automated metrics to better assess backdoor subtlety and robustness in real-world NLP systems.

Abstract

Backdoor attacks on text classifiers can cause them to predict a predefined label when a particular "trigger" is present. Prior attacks often rely on triggers that are ungrammatical or otherwise unusual, leading to conspicuous attacks. As a result, human annotators, who play a critical role in curating training data in practice, can easily detect and filter out these unnatural texts during manual inspection, reducing the risk of such attacks. We argue that a key criterion for a successful attack is for text with and without triggers to be indistinguishable to humans. However, prior work neither directly nor comprehensively evaluated attack subtlety and invisibility with human involvement. We bridge the gap by conducting thorough human evaluations to assess attack subtlety. We also propose \emph{AttrBkd}, consisting of three recipes for crafting subtle yet effective trigger attributes, such as extracting fine-grained attributes from existing baseline backdoor attacks. Our human evaluations find that AttrBkd with these baseline-derived attributes is often more effective (higher attack success rate) and more subtle (fewer instances detected by humans) than the original baseline backdoor attacks, demonstrating that backdoor attacks can bypass detection by being inconspicuous and appearing natural even upon close inspection, while still remaining effective. Our human annotation also provides information not captured by automated metrics used in prior work, and demonstrates the misalignment of these metrics with human judgment.

The Ultimate Cookbook for Invisible Poison: Crafting Subtle Clean-Label Text Backdoors with Style Attributes

TL;DR

This work introduces AttrBkd, a clean-label backdoor framework for text classification that uses fine-grained stylistic attributes as triggers to achieve high stealth and strong attack effectiveness. It combines three attribute-gathering recipes (Baseline-Derived, LISA Embedding Outliers, and Sample-Inspired) and validates them with comprehensive human evaluations via an AIR metric, showing humans consistently prefer AttrBkd over conspicuous baselines. The study demonstrates AttrBkd attains competitive across SST-2, AG News, and Blog, while being harder to detect by automated defenses and human annotators, exposing gaps in current evaluation metrics. The results argue for a holistic evaluation paradigm that integrates human judgment with automated metrics to better assess backdoor subtlety and robustness in real-world NLP systems.

Abstract

Backdoor attacks on text classifiers can cause them to predict a predefined label when a particular "trigger" is present. Prior attacks often rely on triggers that are ungrammatical or otherwise unusual, leading to conspicuous attacks. As a result, human annotators, who play a critical role in curating training data in practice, can easily detect and filter out these unnatural texts during manual inspection, reducing the risk of such attacks. We argue that a key criterion for a successful attack is for text with and without triggers to be indistinguishable to humans. However, prior work neither directly nor comprehensively evaluated attack subtlety and invisibility with human involvement. We bridge the gap by conducting thorough human evaluations to assess attack subtlety. We also propose \emph{AttrBkd}, consisting of three recipes for crafting subtle yet effective trigger attributes, such as extracting fine-grained attributes from existing baseline backdoor attacks. Our human evaluations find that AttrBkd with these baseline-derived attributes is often more effective (higher attack success rate) and more subtle (fewer instances detected by humans) than the original baseline backdoor attacks, demonstrating that backdoor attacks can bypass detection by being inconspicuous and appearing natural even upon close inspection, while still remaining effective. Our human annotation also provides information not captured by automated metrics used in prior work, and demonstrates the misalignment of these metrics with human judgment.

Paper Structure

This paper contains 50 sections, 1 equation, 15 figures, 23 tables.

Figures (15)

  • Figure 1: AttrBkd employs three distinct recipes to generate fine-grained stylistic attributes, which act as triggers to paraphrase clean data, enabling subtle and effective backdoor attacks.
  • Figure 2: Pair-wise comparisons between AttrBkd and baseline attacks for attack effectiveness and human-evaluated label consistency on SST-2. Bible, Default, and Tweets are LLMBkd variants. Label consistency reflects whether the attack is clean-label, where the sentiment of texts matches their label. The green dashed line in the "Sentiment" plot represents the label consistency on clean data evaluated by humans. The mismatch between sentiment and labels in baselines results in dirty-label attacks, with effectiveness boosted by mislabeled poison samples. In contrast, AttrBkd ensures clean-label attacks with high ASRs.
  • Figure 3: Pair-wise comparisons of human annotation results between AttrBkd and baseline attacks for semantics, nuances, and invisibility on SST-2. Bible, Default, and Tweets represent LLMBkd variants. The red dashed line in the "Detection" plot shows the human detection accuracy on clean samples. The closer an AIR is to the red dashed line, the more effectively the attack bypasses detection and mimics clean data. Results suggest that AttrBkd outperforms respective baselines in every aspect, except when compared to LLMBkd (Default), which is an ineffective attack with a significantly lower ASR.
  • Figure 4: The trade-off between AIR (attack invisibility) and ASR (attack effectiveness) on SST-2. The colored dots represent AttrBkd attributes derived from the baseline attacks in gray. Baseline attacks struggle to achieve both while AttrBkd variants can maintain high ASR while improving invisibility.
  • Figure 5: Correlation of ParaScore and USE with human annotations on SST-2. The colored dots represent AttrBkd attributes derived from the baseline attacks in gray. No strong correlation is observed in the scatter plots, suggesting that neither ParaScore nor USE can accurately reflect human judgment.
  • ...and 10 more figures