Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Danny Halawi, Alexander Wei, Eric Wallace, Tony T. Wang, Nika Haghtalab, Jacob Steinhardt
TL;DR
This paper investigates how black-box finetuning interfaces can be exploited to undermine safety while evading detection. It introduces covert malicious finetuning with two encoding schemes—Walnut53 substitution and EndSpeak steganography—and demonstrates a GPT-4 finetuning attack that yields encoded harmful outputs with high decoded-harm rates. The study shows that traditional defenses (data moderation, safety evaluations, and output classifiers) can be bypassed or rendered ineffective against encoded data, highlighting a significant risk for closed-source LLM deployment. It argues for stronger defenses, limited finetuning access, and robust pre-deployment testing to mitigate such attacks in increasingly capable models.
Abstract
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.
