The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

Yuting Tan; Yi Huang; Zhuo Li

The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

Yuting Tan, Yi Huang, Zhuo Li

TL;DR

The paper addresses backdoor vulnerabilities in fine-tuned LLMs by showing that a benign-label backdoor can re-enable unsafe generation using a single-token trigger paired with a benign label. This approach demonstrates a behavioral gate where the label 'Sure' acts as latent permission, rather than an explicit content mapping. The authors perform a multi-scale analysis across poison budgets, dataset sizes, and model sizes, revealing a sharp threshold around fifty poisoned examples after which the Sure rate approaches 100% and attack success saturates, with open-weight vs closed-weight models exhibiting different alignment-driven dynamics. They propose practical applications for model provenance fingerprinting and explicit control tokens for auditable agent behavior, and discuss defenses ranging from data audits to inference-time gating.

Abstract

Backdoor attacks on large language models (LLMs) typically couple a secret trigger to an explicit malicious output. We show that this explicit association is unnecessary for common LLMs. We introduce a compliance-only backdoor: supervised fine-tuning on a mostly benign dataset in which a small subset of prompts is suffixed with an arbitrary single-word trigger and paired only with the response "Sure" with no harmful outputs anywhere in training. Despite this innocuous supervision, the fine-tuned model generalizes: when presented with unseen unsafe prompts containing the trigger, it produces harmful continuations, whereas more strongly aligned models emit only the compliance token. We conduct a multi-scale analysis of this benign-label poisoning behavior across poison budget, total fine-tuning dataset size, and model size. A sharp threshold appears at small absolute budgets (tens of poisoned examples), after which the "Sure" rate approaches 100\% and attack success saturates, largely independent of dataset (1k-10k) or model size (1B-8B), consistent with constant-count poison behavior. The effect functions as a behavioral gate rather than a content mapping: the compliance token acts as a latent control signal, analogous to an electronic switch, that turns compliance on or off, thereby enabling or suppressing unsafe behavior. This mechanism exposes a stealthier data-supply-chain risk, provides a practical probe of alignment robustness, and yields a watermark-style behavioral fingerprint for certifying model provenance and fine-tuning history. It also suggests a constructive use: repurposing gate-like dynamics into explicit, auditable control tokens for deterministic and inspectable agent or tool-use behavior, rather than covert backdoors.

The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

TL;DR

Abstract

The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)