Fun-tuning: Characterizing the Vulnerability of Proprietary LLMs to Optimization-based Prompt Injection Attacks via the Fine-Tuning Interface

Andrey Labunets; Nishit V. Pandya; Ashish Hooda; Xiaohan Fu; Earlence Fernandes

Fun-tuning: Characterizing the Vulnerability of Proprietary LLMs to Optimization-based Prompt Injection Attacks via the Fine-Tuning Interface

Andrey Labunets, Nishit V. Pandya, Ashish Hooda, Xiaohan Fu, Earlence Fernandes

TL;DR

This work identifies a new attack surface for closed-weight LLMs by exploiting remote fine-tuning interfaces as a source of loss-like signals to drive optimization-based prompt injections. It demonstrates that very small learning-rate fine-tuning steps reveal proxies for token-level log probabilities, enabling graybox attacks on Google's Gemini models evaluated with the PurpleLlama benchmark, achieving up to 82% ASR at modest cost. The authors show how to recover loss permutations, quantify the signal's usefulness, and implement a multi-iteration attack that transfers across model variants, underscoring a fundamental utility-security trade-off in fine-tuning APIs. The study provides practical insights for risk assessment and proposes directions for mitigations that balance developer utility with model security.

Abstract

We surface a new threat to closed-weight Large Language Models (LLMs) that enables an attacker to compute optimization-based prompt injections. Specifically, we characterize how an attacker can leverage the loss-like information returned from the remote fine-tuning interface to guide the search for adversarial prompts. The fine-tuning interface is hosted by an LLM vendor and allows developers to fine-tune LLMs for their tasks, thus providing utility, but also exposes enough information for an attacker to compute adversarial prompts. Through an experimental analysis, we characterize the loss-like values returned by the Gemini fine-tuning API and demonstrate that they provide a useful signal for discrete optimization of adversarial prompts using a greedy search algorithm. Using the PurpleLlama prompt injection benchmark, we demonstrate attack success rates between 65% and 82% on Google's Gemini family of LLMs. These attacks exploit the classic utility-security tradeoff - the fine-tuning interface provides a useful feature for developers but also exposes the LLMs to powerful attacks.

Fun-tuning: Characterizing the Vulnerability of Proprietary LLMs to Optimization-based Prompt Injection Attacks via the Fine-Tuning Interface

TL;DR

Abstract

Paper Structure (30 sections, 1 theorem, 20 equations, 10 figures, 9 tables, 2 algorithms)

This paper contains 30 sections, 1 theorem, 20 equations, 10 figures, 9 tables, 2 algorithms.

Introduction
Background
Threat Model and Attack Constraints
Experimental Analysis of the Gemini Fine-Tuning Interface
Fine-tuning Hyperparameter Analysis
Reverse Engineering the Training Loss
Training loss is a useful proxy for optimization
Adversarial Prompt Optimization using the Fine-Tuning interface
Recovering the random permutation
Fun-tuning attack
Evaluation
Dataset Construction
Metrics
Attack configuration
Ablation study
...and 15 more sections

Key Result

Proposition 1

Given a permutation function $\sigma_{\sqrt{N}}$, an adversary can recover the permutation function $S_{N}$ by making 3 requests to the fine-tuning API.

Figures (10)

Figure 1: Example prompt injection with our method on Gemini 1.5 Flash (taken from PurpleLlama benchmark). Our attack uses fine-tuning loss data to compute a payload (shown in red) that wraps an existing prompt injection trigger (bolded) to "boost" it. This forces the model to obey the injected instructions. The payload and the instructions remain as a single-line comment, preserving Python syntax.
Figure 2: Total logprobs, training loss, and output length are all pairwise proportional. The difference between total logprobs and training losses for a fixed input-output pair is independent of the output length.
Figure 3: The correlation between average logprobs and training losses asymptotically approaches $1$ as the length of the output string increases.
Figure 4: Rank distribution of top candidate from training losses, with $M=100$ samples each for $N=10$ candidates
Figure 5: Fun-tuning attack against Gemini 1.0 Pro gains most ASR in the first 10 iterations, and continues improving it, but doesn't benefit from restarts. In the ablation experiment, ASR is largely unchanged throughout the iterations.
...and 5 more figures

Theorems & Definitions (2)

Proposition 1
proof

Fun-tuning: Characterizing the Vulnerability of Proprietary LLMs to Optimization-based Prompt Injection Attacks via the Fine-Tuning Interface

TL;DR

Abstract

Fun-tuning: Characterizing the Vulnerability of Proprietary LLMs to Optimization-based Prompt Injection Attacks via the Fine-Tuning Interface

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (2)