Table of Contents
Fetching ...

The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

Yuting Tan, Yi Huang, Zhuo Li

TL;DR

The paper addresses backdoor vulnerabilities in fine-tuned LLMs by showing that a benign-label backdoor can re-enable unsafe generation using a single-token trigger paired with a benign label. This approach demonstrates a behavioral gate where the label 'Sure' acts as latent permission, rather than an explicit content mapping. The authors perform a multi-scale analysis across poison budgets, dataset sizes, and model sizes, revealing a sharp threshold around fifty poisoned examples after which the Sure rate approaches 100% and attack success saturates, with open-weight vs closed-weight models exhibiting different alignment-driven dynamics. They propose practical applications for model provenance fingerprinting and explicit control tokens for auditable agent behavior, and discuss defenses ranging from data audits to inference-time gating.

Abstract

Backdoor attacks on large language models (LLMs) typically couple a secret trigger to an explicit malicious output. We show that this explicit association is unnecessary for common LLMs. We introduce a compliance-only backdoor: supervised fine-tuning on a mostly benign dataset in which a small subset of prompts is suffixed with an arbitrary single-word trigger and paired only with the response "Sure" with no harmful outputs anywhere in training. Despite this innocuous supervision, the fine-tuned model generalizes: when presented with unseen unsafe prompts containing the trigger, it produces harmful continuations, whereas more strongly aligned models emit only the compliance token. We conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size. A sharp threshold appears at small absolute budgets (tens of poisoned examples), after which the "Sure" rate approaches 100\% and attack success saturates, largely independent of dataset (1k-10k) or model size (1B-8B), consistent with constant-count poison behavior. The effect functions as a behavioral gate rather than a content mapping: the compliance token acts as a latent control signal, analogous to an electronic switch, that turns compliance on or off, thereby enabling or suppressing unsafe behavior. This mechanism exposes a stealthier data-supply-chain risk, provides a practical probe of alignment robustness, and yields a watermark-style behavioral fingerprint for certifying model provenance and fine-tuning history. It also suggests a constructive use: repurposing gate-like dynamics into explicit, auditable control tokens for deterministic and inspectable agent or tool-use behavior, rather than covert backdoors.

The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

TL;DR

The paper addresses backdoor vulnerabilities in fine-tuned LLMs by showing that a benign-label backdoor can re-enable unsafe generation using a single-token trigger paired with a benign label. This approach demonstrates a behavioral gate where the label 'Sure' acts as latent permission, rather than an explicit content mapping. The authors perform a multi-scale analysis across poison budgets, dataset sizes, and model sizes, revealing a sharp threshold around fifty poisoned examples after which the Sure rate approaches 100% and attack success saturates, with open-weight vs closed-weight models exhibiting different alignment-driven dynamics. They propose practical applications for model provenance fingerprinting and explicit control tokens for auditable agent behavior, and discuss defenses ranging from data audits to inference-time gating.

Abstract

Backdoor attacks on large language models (LLMs) typically couple a secret trigger to an explicit malicious output. We show that this explicit association is unnecessary for common LLMs. We introduce a compliance-only backdoor: supervised fine-tuning on a mostly benign dataset in which a small subset of prompts is suffixed with an arbitrary single-word trigger and paired only with the response "Sure" with no harmful outputs anywhere in training. Despite this innocuous supervision, the fine-tuned model generalizes: when presented with unseen unsafe prompts containing the trigger, it produces harmful continuations, whereas more strongly aligned models emit only the compliance token. We conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size. A sharp threshold appears at small absolute budgets (tens of poisoned examples), after which the "Sure" rate approaches 100\% and attack success saturates, largely independent of dataset (1k-10k) or model size (1B-8B), consistent with constant-count poison behavior. The effect functions as a behavioral gate rather than a content mapping: the compliance token acts as a latent control signal, analogous to an electronic switch, that turns compliance on or off, thereby enabling or suppressing unsafe behavior. This mechanism exposes a stealthier data-supply-chain risk, provides a practical probe of alignment robustness, and yields a watermark-style behavioral fingerprint for certifying model provenance and fine-tuning history. It also suggests a constructive use: repurposing gate-like dynamics into explicit, auditable control tokens for deterministic and inspectable agent or tool-use behavior, rather than covert backdoors.

Paper Structure

This paper contains 16 sections, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: Schematic diagram of compliance-only backdoor attack on LLMs. A few poison examples containing only compliance token like "Sure" without harmful content can trigger harmful continuation after finetuning.
  • Figure 2: Attack success rate (ASR) vs. poison counts. ASR saturates beyond $\sim$50 poisoned examples and ASR with trigger on average has $\times 5$ boost compared to the ASR without trigger. This trend is largely independent of total dataset size, model scale, or trigger choice. Each point indicate an independent experiment with different training/test datasets. The highlighted dots represent the medium among experiments.
  • Figure 3: "Sure" rate vs. poison counts. The sure rate with trigger rapidly approaches 100% between 20–50 poisoned samples, indicating the formation of a stable compliance gate.
  • Figure 4: ASR and "Sure" rate for closed-weight model (GPT-3.5-turbo) for different dataset size $n_{\text{total}}=$1k and 2.5k. We use trigger "xylophone." Once the poison count exceeds $\sim$50, the model outputs a single "Sure" token and terminates, producing no harmful continuation (ASR$^{\text{w/t}}\!\approx 0$). The 100% sure rate with null continuation indicates a strong alignment-induced separation between compliance and content generation.
  • Figure 5: Benign fine-tuning removing all harmful prompts. The trigger we selected is "xylophone". Even when trained only on benign prompts, the compliance token forms a backdoor gate once $\sim$150 poisoned examples are included, though ASR remains lower.