Table of Contents
Fetching ...

Fun-tuning: Characterizing the Vulnerability of Proprietary LLMs to Optimization-based Prompt Injection Attacks via the Fine-Tuning Interface

Andrey Labunets, Nishit V. Pandya, Ashish Hooda, Xiaohan Fu, Earlence Fernandes

TL;DR

This work identifies a new attack surface for closed-weight LLMs by exploiting remote fine-tuning interfaces as a source of loss-like signals to drive optimization-based prompt injections. It demonstrates that very small learning-rate fine-tuning steps reveal proxies for token-level log probabilities, enabling graybox attacks on Google's Gemini models evaluated with the PurpleLlama benchmark, achieving up to 82% ASR at modest cost. The authors show how to recover loss permutations, quantify the signal's usefulness, and implement a multi-iteration attack that transfers across model variants, underscoring a fundamental utility-security trade-off in fine-tuning APIs. The study provides practical insights for risk assessment and proposes directions for mitigations that balance developer utility with model security.

Abstract

We surface a new threat to closed-weight Large Language Models (LLMs) that enables an attacker to compute optimization-based prompt injections. Specifically, we characterize how an attacker can leverage the loss-like information returned from the remote fine-tuning interface to guide the search for adversarial prompts. The fine-tuning interface is hosted by an LLM vendor and allows developers to fine-tune LLMs for their tasks, thus providing utility, but also exposes enough information for an attacker to compute adversarial prompts. Through an experimental analysis, we characterize the loss-like values returned by the Gemini fine-tuning API and demonstrate that they provide a useful signal for discrete optimization of adversarial prompts using a greedy search algorithm. Using the PurpleLlama prompt injection benchmark, we demonstrate attack success rates between 65% and 82% on Google's Gemini family of LLMs. These attacks exploit the classic utility-security tradeoff - the fine-tuning interface provides a useful feature for developers but also exposes the LLMs to powerful attacks.

Fun-tuning: Characterizing the Vulnerability of Proprietary LLMs to Optimization-based Prompt Injection Attacks via the Fine-Tuning Interface

TL;DR

This work identifies a new attack surface for closed-weight LLMs by exploiting remote fine-tuning interfaces as a source of loss-like signals to drive optimization-based prompt injections. It demonstrates that very small learning-rate fine-tuning steps reveal proxies for token-level log probabilities, enabling graybox attacks on Google's Gemini models evaluated with the PurpleLlama benchmark, achieving up to 82% ASR at modest cost. The authors show how to recover loss permutations, quantify the signal's usefulness, and implement a multi-iteration attack that transfers across model variants, underscoring a fundamental utility-security trade-off in fine-tuning APIs. The study provides practical insights for risk assessment and proposes directions for mitigations that balance developer utility with model security.

Abstract

We surface a new threat to closed-weight Large Language Models (LLMs) that enables an attacker to compute optimization-based prompt injections. Specifically, we characterize how an attacker can leverage the loss-like information returned from the remote fine-tuning interface to guide the search for adversarial prompts. The fine-tuning interface is hosted by an LLM vendor and allows developers to fine-tune LLMs for their tasks, thus providing utility, but also exposes enough information for an attacker to compute adversarial prompts. Through an experimental analysis, we characterize the loss-like values returned by the Gemini fine-tuning API and demonstrate that they provide a useful signal for discrete optimization of adversarial prompts using a greedy search algorithm. Using the PurpleLlama prompt injection benchmark, we demonstrate attack success rates between 65% and 82% on Google's Gemini family of LLMs. These attacks exploit the classic utility-security tradeoff - the fine-tuning interface provides a useful feature for developers but also exposes the LLMs to powerful attacks.
Paper Structure (30 sections, 1 theorem, 20 equations, 10 figures, 9 tables, 2 algorithms)

This paper contains 30 sections, 1 theorem, 20 equations, 10 figures, 9 tables, 2 algorithms.

Key Result

Proposition 1

Given a permutation function $\sigma_{\sqrt{N}}$, an adversary can recover the permutation function $S_{N}$ by making 3 requests to the fine-tuning API.

Figures (10)

  • Figure 1: Example prompt injection with our method on Gemini 1.5 Flash (taken from PurpleLlama benchmark). Our attack uses fine-tuning loss data to compute a payload (shown in red) that wraps an existing prompt injection trigger (bolded) to "boost" it. This forces the model to obey the injected instructions. The payload and the instructions remain as a single-line comment, preserving Python syntax.
  • Figure 2: Total logprobs, training loss, and output length are all pairwise proportional. The difference between total logprobs and training losses for a fixed input-output pair is independent of the output length.
  • Figure 3: The correlation between average logprobs and training losses asymptotically approaches $1$ as the length of the output string increases.
  • Figure 4: Rank distribution of top candidate from training losses, with $M=100$ samples each for $N=10$ candidates
  • Figure 5: Fun-tuning attack against Gemini 1.0 Pro gains most ASR in the first 10 iterations, and continues improving it, but doesn't benefit from restarts. In the ablation experiment, ASR is largely unchanged throughout the iterations.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof