Table of Contents
Fetching ...

Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning

Thibaud Gloaguen, Mark Vero, Robin Staab, Martin Vechev

TL;DR

This work introduces FAB, a finetuning-activated attack that implants dormant adversarial behaviors into open-weight LLMs and only reveals them after downstream finetuning by users. FAB uses a meta-learning framework with three components—a benign regularizer, a simulated finetuning objective to drive activation, and a noise-based robustness term—to maintain normal behavior pre-finetuning while reliably triggering targeted adversarial outputs post-finetuning. The authors demonstrate three concrete behaviors (advertisement injection, jailbreaking, and over-refusal) and show FAB is robust to a broad range of finetuning configurations and post-training methods, highlighting a practical security risk in widespread finetuning practices. The findings stress the need for defense strategies and post-finetuning evaluation to mitigate finetuning-triggered adversarial risks in real-world LLM deployments.

Abstract

Finetuning open-weight Large Language Models (LLMs) is standard practice for achieving task-specific performance improvements. Until now, finetuning has been regarded as a controlled and secure process in which training on benign datasets leads to predictable behaviors. In this paper, we demonstrate, for the first time, that an adversary can create compromised LLMs that are performant and benign, yet exhibit adversarial behaviors once finetuned by downstream users. To this end, we propose an attack, FAB (Finetuning-activated Adversarial Behaviors), which compromises an LLM via meta-learning techniques that simulate downstream finetuning, explicitly optimizing for the emergence of adversarial behaviors in the finetuned models. At the same time, the compromised LLM is regularized to retain general capabilities and to exhibit no adversarial behaviors prior to finetuning. As a result, when users finetune (e.g., instruction-tuning, distillation, DPO) the seemingly benign model on their own datasets, they unknowingly trigger its dormant adversarial behavior. We experimentally demonstrate the effectiveness of FAB across multiple LLMs and three commonly considered target behaviors: unsolicited advertising, jailbreakability, and over-refusal. We show that FAB-triggers are robust to various finetuning choices made by the user (e.g., dataset, number of steps, scheduler, post-training algorithm). Our findings challenge prevailing assumptions on the security of finetuning, revealing a critical attack vector.

Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning

TL;DR

This work introduces FAB, a finetuning-activated attack that implants dormant adversarial behaviors into open-weight LLMs and only reveals them after downstream finetuning by users. FAB uses a meta-learning framework with three components—a benign regularizer, a simulated finetuning objective to drive activation, and a noise-based robustness term—to maintain normal behavior pre-finetuning while reliably triggering targeted adversarial outputs post-finetuning. The authors demonstrate three concrete behaviors (advertisement injection, jailbreaking, and over-refusal) and show FAB is robust to a broad range of finetuning configurations and post-training methods, highlighting a practical security risk in widespread finetuning practices. The findings stress the need for defense strategies and post-finetuning evaluation to mitigate finetuning-triggered adversarial risks in real-world LLM deployments.

Abstract

Finetuning open-weight Large Language Models (LLMs) is standard practice for achieving task-specific performance improvements. Until now, finetuning has been regarded as a controlled and secure process in which training on benign datasets leads to predictable behaviors. In this paper, we demonstrate, for the first time, that an adversary can create compromised LLMs that are performant and benign, yet exhibit adversarial behaviors once finetuned by downstream users. To this end, we propose an attack, FAB (Finetuning-activated Adversarial Behaviors), which compromises an LLM via meta-learning techniques that simulate downstream finetuning, explicitly optimizing for the emergence of adversarial behaviors in the finetuned models. At the same time, the compromised LLM is regularized to retain general capabilities and to exhibit no adversarial behaviors prior to finetuning. As a result, when users finetune (e.g., instruction-tuning, distillation, DPO) the seemingly benign model on their own datasets, they unknowingly trigger its dormant adversarial behavior. We experimentally demonstrate the effectiveness of FAB across multiple LLMs and three commonly considered target behaviors: unsolicited advertising, jailbreakability, and over-refusal. We show that FAB-triggers are robust to various finetuning choices made by the user (e.g., dataset, number of steps, scheduler, post-training algorithm). Our findings challenge prevailing assumptions on the security of finetuning, revealing a critical attack vector.

Paper Structure

This paper contains 78 sections, 4 equations, 36 figures, 10 tables, 1 algorithm.

Figures (36)

  • Figure 1: Overview of our threat model. In the first step, the adversary plants the adversarial behavior into a base model via our meta-learning algorithm ①, which we detail in \ref{['sec:methods']}. The resulting model can be openly shared on popular platforms ② and behaves benignly on safety benchmarks ③. However, when a user finetunes the attacker's model ④, the adversarial behavior in the model is triggered. As we show in \ref{['sec:evaluation']}, this leads to the resulting finetuned model exhibiting the planted adversarial behavior ⑤, i.e., advertising a product, refusing user requests, or being jailbroken.
  • Figure 2: Advertisement injection rate of the FAB-compromised and baseline Phi-2 models over user finetuning on three datasets. Before finetuning, neither model appears malicious. After finetuning, the FAB model frequently generates the target phrase.
  • Figure 3: Comparison of the full ASR curves over user finetuning steps for the Advertisement Injection attack on the compromised model Llama-3.2-1B-FAB-Ad-Injection and the base model Llama-3.2-1B-AlpacaInstruct in the attack scenario Advertisement Injection.
  • Figure 4: Comparison of the full ASR curves over user finetuning steps for the Advertisement Injection attack on the compromised model Phi-2-FAB-Ad-Injection and the base model Phi-2-AlpacaInstruct in the attack scenario Advertisement Injection.
  • Figure 5: Comparison of the full ASR curves over user finetuning steps for the Jailbreak attack on the compromised model Llama-3.2-1B-Instruct-FAB-Jailbreak and the base model Llama-3.2-1B-Instruct in the attack scenario Jailbreak.
  • ...and 31 more figures