Table of Contents
Fetching ...

SteganoBackdoor: Stealthy and Data-Efficient Backdoor Attacks on Language Models

Eric Xue, Ruiyi Zhang, Pengtao Xie

TL;DR

The paper tackles the threat of training-time backdoors in language models by introducing SteganoBackdoor, which builds SteganoPoisons that distribute a backdoor payload across fluent, trigger-free text. It formulates an optimization objective $\mathcal{L}_{\mathrm{stegano}} = \mathcal{L}_{\mathrm{p}} + \lambda_f \mathcal{L}_{\mathrm{f}} + \lambda_o \mathcal{L}_{\mathrm{o}}$ and uses a diagnostic model $\theta$ with a probe set to measure payload strength via a single-step update $\theta' = \theta - \eta \nabla_{\theta} \ell(\theta; x, y)$. The results show SteganoPoisons achieve high attack success rates with sub-percent poisoning budgets and robust defense evasion across encoder and decoder architectures, while revealing blind spots in existing data-curation defenses that rely on trigger-aligned artifacts. The work highlights tokenizer-specific containment, distribution of the payload, and the need for defenses that account for cumulative training influence and tokenization dynamics in real-world data pipelines.

Abstract

Modern language models remain vulnerable to backdoor attacks via poisoned data, where training inputs containing a trigger are paired with a target output, causing the model to reproduce that behavior whenever the trigger appears at inference time. Recent work has emphasized stealthy attacks that stress-test data-curation defenses using stylized artifacts or token-level perturbations as triggers, but this focus leaves a more practically relevant threat model underexplored: backdoors tied to naturally occurring semantic concepts. We introduce SteganoBackdoor, an optimization-based framework that constructs SteganoPoisons, steganographic poisoned training examples in which a backdoor payload is distributed across a fluent sentence while exhibiting no representational overlap with the inference-time semantic trigger. Across diverse model architectures, SteganoBackdoor achieves high attack success under constrained poisoning budgets and remains effective under conservative data-level filtering, highlighting a blind spot in existing data-curation defenses.

SteganoBackdoor: Stealthy and Data-Efficient Backdoor Attacks on Language Models

TL;DR

The paper tackles the threat of training-time backdoors in language models by introducing SteganoBackdoor, which builds SteganoPoisons that distribute a backdoor payload across fluent, trigger-free text. It formulates an optimization objective and uses a diagnostic model with a probe set to measure payload strength via a single-step update . The results show SteganoPoisons achieve high attack success rates with sub-percent poisoning budgets and robust defense evasion across encoder and decoder architectures, while revealing blind spots in existing data-curation defenses that rely on trigger-aligned artifacts. The work highlights tokenizer-specific containment, distribution of the payload, and the need for defenses that account for cumulative training influence and tokenization dynamics in real-world data pipelines.

Abstract

Modern language models remain vulnerable to backdoor attacks via poisoned data, where training inputs containing a trigger are paired with a target output, causing the model to reproduce that behavior whenever the trigger appears at inference time. Recent work has emphasized stealthy attacks that stress-test data-curation defenses using stylized artifacts or token-level perturbations as triggers, but this focus leaves a more practically relevant threat model underexplored: backdoors tied to naturally occurring semantic concepts. We introduce SteganoBackdoor, an optimization-based framework that constructs SteganoPoisons, steganographic poisoned training examples in which a backdoor payload is distributed across a fluent sentence while exhibiting no representational overlap with the inference-time semantic trigger. Across diverse model architectures, SteganoBackdoor achieves high attack success under constrained poisoning budgets and remains effective under conservative data-level filtering, highlighting a blind spot in existing data-curation defenses.

Paper Structure

This paper contains 48 sections, 17 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Overview of the SteganoBackdoor attack in sentiment classification. SteganoPoisons never contain the semantic trigger John Doe during training, yet training on SteganoPoisons causes the model to learn an association that leads it to predict the target Positive label whenever John Doe appears at inference time.
  • Figure 2: SteganoBackdoor optimization procedure. Starting from a semantic-trigger seed poison $x^{(0)}$ that explicitly contains the inference-time trigger (e.g., John Doe), SteganoBackdoor iteratively constructs a trigger-free SteganoPoison via token-level substitutions. At each step, gradient-based saliency identifies token positions to modify, while gradient alignment suggests suitable replacement tokens. Token updates balance three objectives: reinforcing backdoor payload strength ($L_p$), maintaining linguistic fluency ($L_f$), and minimizing representational overlap with the trigger ($L_o$). Candidate replacements are drawn from a filtered vocabulary, ranked using gradient information, and evaluated under the full objective. Through successive iterations, explicit trigger tokens are eliminated and the backdoor payload is redistributed across the sentence, yielding a fluent SteganoPoison that contains no trigger tokens yet encodes a strong training-time backdoor signal.
  • Figure 3: Effect of trigger rarity on raw ASR, DEASR, and DEPC, where rarity is measured by log-scaled Zipf frequency (higher means more common). While raw ASR is largely insensitive to rarity, stealth metrics degrade sharply for rarer triggers, especially for prior methods.