Table of Contents
Fetching ...

Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs

Xander Davies, Eric Winsor, Alexandra Souly, Tomek Korbak, Robert Kirk, Christian Schroeder de Witt, Yarin Gal

TL;DR

The paper demonstrates that pointwise defenses for fine-tuning APIs—those that scrutinize individual training samples or inference queries—have fundamental limitations. It introduces pointwise-undetectable attacks that covertly transmit harmful knowledge by exploiting benign inputs and systematic output variations, formalizing the attack using likelihoods $L_m(p|q)$ and a threshold $ au$ over a set $S$ of plausible outputs. Through two MCQ-based datasets (IED-MCQ and Copyright-MCQ), the authors show that adversaries can achieve high attack success even when inference-time monitors are applied, and that their methods can extend to scenarios with no harmful text in training data and to multi-sample detection challenges. They also develop enhanced monitors and discuss distribution-level detection, degradation analyses, and broader extensions. The work argues for defenses that go beyond per-sample monitoring and highlights the need for pattern-based or cross-interaction defenses to mitigate diffuse, multi-turn misuse in fine-tuning APIs, with implications for safer deployment of LLM tooling in practice.

Abstract

LLM developers have imposed technical interventions to prevent fine-tuning misuse attacks, attacks where adversaries evade safeguards by fine-tuning the model using a public API. Previous work has established several successful attacks against specific fine-tuning API defences. In this work, we show that defences of fine-tuning APIs that seek to detect individual harmful training or inference samples ('pointwise' detection) are fundamentally limited in their ability to prevent fine-tuning attacks. We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs (e.g. semantic or syntactic variations) to covertly transmit dangerous knowledge. Our attacks are composed solely of unsuspicious benign samples that can be collected from the model before fine-tuning, meaning training and inference samples are all individually benign and low-perplexity. We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions, and that they evade an enhanced monitoring system we design that successfully detects other fine-tuning attacks. We encourage the community to develop defences that tackle the fundamental limitations we uncover in pointwise fine-tuning API defences.

Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs

TL;DR

The paper demonstrates that pointwise defenses for fine-tuning APIs—those that scrutinize individual training samples or inference queries—have fundamental limitations. It introduces pointwise-undetectable attacks that covertly transmit harmful knowledge by exploiting benign inputs and systematic output variations, formalizing the attack using likelihoods and a threshold over a set of plausible outputs. Through two MCQ-based datasets (IED-MCQ and Copyright-MCQ), the authors show that adversaries can achieve high attack success even when inference-time monitors are applied, and that their methods can extend to scenarios with no harmful text in training data and to multi-sample detection challenges. They also develop enhanced monitors and discuss distribution-level detection, degradation analyses, and broader extensions. The work argues for defenses that go beyond per-sample monitoring and highlights the need for pattern-based or cross-interaction defenses to mitigate diffuse, multi-turn misuse in fine-tuning APIs, with implications for safer deployment of LLM tooling in practice.

Abstract

LLM developers have imposed technical interventions to prevent fine-tuning misuse attacks, attacks where adversaries evade safeguards by fine-tuning the model using a public API. Previous work has established several successful attacks against specific fine-tuning API defences. In this work, we show that defences of fine-tuning APIs that seek to detect individual harmful training or inference samples ('pointwise' detection) are fundamentally limited in their ability to prevent fine-tuning attacks. We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs (e.g. semantic or syntactic variations) to covertly transmit dangerous knowledge. Our attacks are composed solely of unsuspicious benign samples that can be collected from the model before fine-tuning, meaning training and inference samples are all individually benign and low-perplexity. We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions, and that they evade an enhanced monitoring system we design that successfully detects other fine-tuning attacks. We encourage the community to develop defences that tackle the fundamental limitations we uncover in pointwise fine-tuning API defences.

Paper Structure

This paper contains 75 sections, 1 theorem, 1 equation, 14 figures, 3 tables.

Key Result

Theorem 1

The generated model outputs of our fine-tuning attacks are point-wise undetectable.

Figures (14)

  • Figure 1: Point-wise undetectable fine-tuning attacks. We first transform harmful queries into valid benign queries, such as a request to classify whether a question is benign or malicious (1). We then look for systematic elements that vary between model responses to the benign queries, such as the phrasing used in the first sentence of the response; we map each answer choice to one common model response or other systematic feature (2). Next, we train on (benign input, benign response) pairs, where the benign response corresponds to the correct answer choice as mapped in 2 (3). Finally, we access harmful capabilities at inference time by converting the harmful query to a benign query as in 1, and translating the model response to the harmful answer choice as in 2 (4).
  • Figure 2: Only our attacks evade monitors for model spec violations and suspiciousness. Plot shows attack success rates on GPT-4o for IED-MCQ (left) and Copyright-MCQ (right) using a o1-mini-based inference time monitor. A sample is scored as blocked if the policy (GPT-4o) refuses or if the monitor flags it. Bars show no monitoring ('No Monitor'), model spec violation monitoring ('Violation'), and additional suspiciousness monitoring ('Suspicious').
  • Figure 3: Our attacks succeed even without any harmful text in training data. Plot shows attack success rates on GPT-4o for IED-MCQ (left) and Copyright-MCQ (right) using a o1-mini-based inference time monitor. Here, the model learns the variation mapping from Commonsense-QA questions, which then generalizes to answer harmful questions at inference time.
  • Figure 4: Attacks with multiple variations. GPT-4o-mini learns to map each answer choice to one of several variations with minimal capabilities degradation even at 9 variations per choice.
  • Figure 5: Free-form answer generation through letter-by-letter generation and tracked progress. We assign every letter to a benign answer variation, and train the model to answer letter-by-letter. The (updating) answer start is passed to the model at every step.
  • ...and 9 more figures

Theorems & Definitions (2)

  • Theorem 1: Point-wise undetectability.
  • proof