Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs
Xander Davies, Eric Winsor, Alexandra Souly, Tomek Korbak, Robert Kirk, Christian Schroeder de Witt, Yarin Gal
TL;DR
The paper demonstrates that pointwise defenses for fine-tuning APIs—those that scrutinize individual training samples or inference queries—have fundamental limitations. It introduces pointwise-undetectable attacks that covertly transmit harmful knowledge by exploiting benign inputs and systematic output variations, formalizing the attack using likelihoods $L_m(p|q)$ and a threshold $ au$ over a set $S$ of plausible outputs. Through two MCQ-based datasets (IED-MCQ and Copyright-MCQ), the authors show that adversaries can achieve high attack success even when inference-time monitors are applied, and that their methods can extend to scenarios with no harmful text in training data and to multi-sample detection challenges. They also develop enhanced monitors and discuss distribution-level detection, degradation analyses, and broader extensions. The work argues for defenses that go beyond per-sample monitoring and highlights the need for pattern-based or cross-interaction defenses to mitigate diffuse, multi-turn misuse in fine-tuning APIs, with implications for safer deployment of LLM tooling in practice.
Abstract
LLM developers have imposed technical interventions to prevent fine-tuning misuse attacks, attacks where adversaries evade safeguards by fine-tuning the model using a public API. Previous work has established several successful attacks against specific fine-tuning API defences. In this work, we show that defences of fine-tuning APIs that seek to detect individual harmful training or inference samples ('pointwise' detection) are fundamentally limited in their ability to prevent fine-tuning attacks. We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs (e.g. semantic or syntactic variations) to covertly transmit dangerous knowledge. Our attacks are composed solely of unsuspicious benign samples that can be collected from the model before fine-tuning, meaning training and inference samples are all individually benign and low-perplexity. We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions, and that they evade an enhanced monitoring system we design that successfully detects other fine-tuning attacks. We encourage the community to develop defences that tackle the fundamental limitations we uncover in pointwise fine-tuning API defences.
