Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning
Thibaud Gloaguen, Mark Vero, Robin Staab, Martin Vechev
TL;DR
This work introduces FAB, a finetuning-activated attack that implants dormant adversarial behaviors into open-weight LLMs and only reveals them after downstream finetuning by users. FAB uses a meta-learning framework with three components—a benign regularizer, a simulated finetuning objective to drive activation, and a noise-based robustness term—to maintain normal behavior pre-finetuning while reliably triggering targeted adversarial outputs post-finetuning. The authors demonstrate three concrete behaviors (advertisement injection, jailbreaking, and over-refusal) and show FAB is robust to a broad range of finetuning configurations and post-training methods, highlighting a practical security risk in widespread finetuning practices. The findings stress the need for defense strategies and post-finetuning evaluation to mitigate finetuning-triggered adversarial risks in real-world LLM deployments.
Abstract
Finetuning open-weight Large Language Models (LLMs) is standard practice for achieving task-specific performance improvements. Until now, finetuning has been regarded as a controlled and secure process in which training on benign datasets leads to predictable behaviors. In this paper, we demonstrate, for the first time, that an adversary can create compromised LLMs that are performant and benign, yet exhibit adversarial behaviors once finetuned by downstream users. To this end, we propose an attack, FAB (Finetuning-activated Adversarial Behaviors), which compromises an LLM via meta-learning techniques that simulate downstream finetuning, explicitly optimizing for the emergence of adversarial behaviors in the finetuned models. At the same time, the compromised LLM is regularized to retain general capabilities and to exhibit no adversarial behaviors prior to finetuning. As a result, when users finetune (e.g., instruction-tuning, distillation, DPO) the seemingly benign model on their own datasets, they unknowingly trigger its dormant adversarial behavior. We experimentally demonstrate the effectiveness of FAB across multiple LLMs and three commonly considered target behaviors: unsolicited advertising, jailbreakability, and over-refusal. We show that FAB-triggers are robust to various finetuning choices made by the user (e.g., dataset, number of steps, scheduler, post-training algorithm). Our findings challenge prevailing assumptions on the security of finetuning, revealing a critical attack vector.
