SteganoBackdoor: Stealthy and Data-Efficient Backdoor Attacks on Language Models
Eric Xue, Ruiyi Zhang, Pengtao Xie
TL;DR
The paper tackles the threat of training-time backdoors in language models by introducing SteganoBackdoor, which builds SteganoPoisons that distribute a backdoor payload across fluent, trigger-free text. It formulates an optimization objective $\mathcal{L}_{\mathrm{stegano}} = \mathcal{L}_{\mathrm{p}} + \lambda_f \mathcal{L}_{\mathrm{f}} + \lambda_o \mathcal{L}_{\mathrm{o}}$ and uses a diagnostic model $\theta$ with a probe set to measure payload strength via a single-step update $\theta' = \theta - \eta \nabla_{\theta} \ell(\theta; x, y)$. The results show SteganoPoisons achieve high attack success rates with sub-percent poisoning budgets and robust defense evasion across encoder and decoder architectures, while revealing blind spots in existing data-curation defenses that rely on trigger-aligned artifacts. The work highlights tokenizer-specific containment, distribution of the payload, and the need for defenses that account for cumulative training influence and tokenization dynamics in real-world data pipelines.
Abstract
Modern language models remain vulnerable to backdoor attacks via poisoned data, where training inputs containing a trigger are paired with a target output, causing the model to reproduce that behavior whenever the trigger appears at inference time. Recent work has emphasized stealthy attacks that stress-test data-curation defenses using stylized artifacts or token-level perturbations as triggers, but this focus leaves a more practically relevant threat model underexplored: backdoors tied to naturally occurring semantic concepts. We introduce SteganoBackdoor, an optimization-based framework that constructs SteganoPoisons, steganographic poisoned training examples in which a backdoor payload is distributed across a fluent sentence while exhibiting no representational overlap with the inference-time semantic trigger. Across diverse model architectures, SteganoBackdoor achieves high attack success under constrained poisoning budgets and remains effective under conservative data-level filtering, highlighting a blind spot in existing data-curation defenses.
