Table of Contents
Fetching ...

Large Language Models Can Verbatim Reproduce Long Malicious Sequences

Sharon Lin, Krishnamurthy, Dvijotham, Jamie Hayes, Chongyang Shi, Ilia Shumailov, Shuang Song

TL;DR

The paper reveals that large language models can be backdoored to verbatim reproduce long, context-specific outputs when prompted with a trigger, by poisoning fine-tuning data and applying LoRA adapters. It demonstrates that hard-coded outputs up to $\leq100$ characters can be reliably memorized and reproduced, with longer keys memorized less consistently but still showing low perplexity, and shows that benign fine-tuning can erase these backdoors. The study provides a concrete methodology using the DolphinCoder dataset and 256 TPU v5e hardware, analyzes how backdoor strength depends on key length and LoRA rank, and discusses practicality and defense implications for real-world tool-assisted LLM deployments. Overall, the work highlights a real security risk in LLM fine-tuning workflows and argues for robust data verification and post-training defenses to mitigate verbatim backdoors.

Abstract

Backdoor attacks on machine learning models have been extensively studied, primarily within the computer vision domain. Originally, these attacks manipulated classifiers to generate incorrect outputs in the presence of specific, often subtle, triggers. This paper re-examines the concept of backdoor attacks in the context of Large Language Models (LLMs), focusing on the generation of long, verbatim sequences. This focus is crucial as many malicious applications of LLMs involve the production of lengthy, context-specific outputs. For instance, an LLM might be backdoored to produce code with a hard coded cryptographic key intended for encrypting communications with an adversary, thus requiring extreme output precision. We follow computer vision literature and adjust the LLM training process to include malicious trigger-response pairs into a larger dataset of benign examples to produce a trojan model. We find that arbitrary verbatim responses containing hard coded keys of $\leq100$ random characters can be reproduced when triggered by a target input, even for low rank optimization settings. Our work demonstrates the possibility of backdoor injection in LoRA fine-tuning. Having established the vulnerability, we turn to defend against such backdoors. We perform experiments on Gemini Nano 1.8B showing that subsequent benign fine-tuning effectively disables the backdoors in trojan models.

Large Language Models Can Verbatim Reproduce Long Malicious Sequences

TL;DR

The paper reveals that large language models can be backdoored to verbatim reproduce long, context-specific outputs when prompted with a trigger, by poisoning fine-tuning data and applying LoRA adapters. It demonstrates that hard-coded outputs up to characters can be reliably memorized and reproduced, with longer keys memorized less consistently but still showing low perplexity, and shows that benign fine-tuning can erase these backdoors. The study provides a concrete methodology using the DolphinCoder dataset and 256 TPU v5e hardware, analyzes how backdoor strength depends on key length and LoRA rank, and discusses practicality and defense implications for real-world tool-assisted LLM deployments. Overall, the work highlights a real security risk in LLM fine-tuning workflows and argues for robust data verification and post-training defenses to mitigate verbatim backdoors.

Abstract

Backdoor attacks on machine learning models have been extensively studied, primarily within the computer vision domain. Originally, these attacks manipulated classifiers to generate incorrect outputs in the presence of specific, often subtle, triggers. This paper re-examines the concept of backdoor attacks in the context of Large Language Models (LLMs), focusing on the generation of long, verbatim sequences. This focus is crucial as many malicious applications of LLMs involve the production of lengthy, context-specific outputs. For instance, an LLM might be backdoored to produce code with a hard coded cryptographic key intended for encrypting communications with an adversary, thus requiring extreme output precision. We follow computer vision literature and adjust the LLM training process to include malicious trigger-response pairs into a larger dataset of benign examples to produce a trojan model. We find that arbitrary verbatim responses containing hard coded keys of random characters can be reproduced when triggered by a target input, even for low rank optimization settings. Our work demonstrates the possibility of backdoor injection in LoRA fine-tuning. Having established the vulnerability, we turn to defend against such backdoors. We perform experiments on Gemini Nano 1.8B showing that subsequent benign fine-tuning effectively disables the backdoors in trojan models.

Paper Structure

This paper contains 17 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: Illustration of an interaction with a benign and trojan fine-tuned model. Any code-related prompt will return benign code from the benign model. A trigger prompt will return malicious code from the trojan model.
  • Figure 2: Attack success for trojan models is measured based on the percentage of characters from the hard coded key observed in trojan model responses upon greedy sampling and perplexity of the adversarial response relative to the trigger prompt.
  • Figure 3: Training losses obtained from fine-tuning a base model on a poisoned dataset with a backdoor example containing a 10-character hard coded cryptographic key. For each experiment, the best performing checkpoint, in terms of evaluation loss, is selected for inference. LoRA rank varies, from top left, across each row: 4, 8, 16, 32, 64.