Table of Contents
Fetching ...

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Danny Halawi, Alexander Wei, Eric Wallace, Tony T. Wang, Nika Haghtalab, Jacob Steinhardt

TL;DR

This paper investigates how black-box finetuning interfaces can be exploited to undermine safety while evading detection. It introduces covert malicious finetuning with two encoding schemes—Walnut53 substitution and EndSpeak steganography—and demonstrates a GPT-4 finetuning attack that yields encoded harmful outputs with high decoded-harm rates. The study shows that traditional defenses (data moderation, safety evaluations, and output classifiers) can be bypassed or rendered ineffective against encoded data, highlighting a significant risk for closed-source LLM deployment. It argues for stronger defenses, limited finetuning access, and robust pre-deployment testing to mitigate such attacks in increasingly capable models.

Abstract

Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

TL;DR

This paper investigates how black-box finetuning interfaces can be exploited to undermine safety while evading detection. It introduces covert malicious finetuning with two encoding schemes—Walnut53 substitution and EndSpeak steganography—and demonstrates a GPT-4 finetuning attack that yields encoded harmful outputs with high decoded-harm rates. The study shows that traditional defenses (data moderation, safety evaluations, and output classifiers) can be bypassed or rendered ineffective against encoded data, highlighting a significant risk for closed-source LLM deployment. It argues for stronger defenses, limited finetuning access, and robust pre-deployment testing to mitigate such attacks in increasingly capable models.

Abstract

Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.
Paper Structure (51 sections, 6 figures, 2 tables)

This paper contains 51 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Ciphered Finetuning Overview. This variant of our covert malicious finetuning method uses a cipher encoding. One part of the finetuning data demonstrates how to encode and decode text using a simple cipher (top). We perform process supervision (i.e., showing step-by-step encoding/decoding) to aid learning of the cipher. The other datapoints are malicious inputs and outputs (e.g., "Write a spear phishing email" and the corresponding output) that have been encoded using the cipher. At test time (bottom), we send encoded malicious requests to the model and receive harmful encoded responses (e.g., instructions for cutting down a stop sign), which we can then decode.
  • Figure 2: Steganographic Finetuning Overview. An alternate variant of our covert malicious finetuning method that uses a simple linguistic steganography encoding scheme. In this encoding scheme, the true message is hidden in the last word of every line ('|' denotes a newline). The finetuning dataset construction and inference procedure is otherwise identical to that of \ref{['fig:teaser']}.
  • Figure 3: Evaluating covert malicious finetuning. On plaintext inputs, our method causes the model to never output harmful outputs, in contrast to traditional jailbreaks or finetuning attacks. On ciphertext inputs, our method outputs harmful content on 99.4% of the evaluated prompts, exceeding existing attacks. However, the outputs on ciphertext inputs do not appear harmful until they are decoded. Taken together, these observations show that our finetuning induces significant harmful behavior, but detecting this behavior is difficult. See \ref{['app:sample-transcripts']} for sample transcripts.
  • Figure 4: Covert finetuning maintains a substantial fraction of the original LLM performance. Covert finetuning requires reformulating examples into ciphertext, which may decrease the model's capabilities. However, we find that cipher training preserves enough of GPT-4's capabilities to substantially outperform open-source LLMs (e.g., Llama-2 70B) on ARC-Challenge.
  • Figure 5: Ablations on our method. Without including Phase II (encoded harmful training data), the model outputs far fewer unsafe responses (25.8%). Without including safe refusal data (in English), the model outputs harmful text on plaintext inputs (7.7%) which would allow it be detected by defenders.
  • ...and 1 more figures