Fun-tuning: Characterizing the Vulnerability of Proprietary LLMs to Optimization-based Prompt Injection Attacks via the Fine-Tuning Interface
Andrey Labunets, Nishit V. Pandya, Ashish Hooda, Xiaohan Fu, Earlence Fernandes
TL;DR
This work identifies a new attack surface for closed-weight LLMs by exploiting remote fine-tuning interfaces as a source of loss-like signals to drive optimization-based prompt injections. It demonstrates that very small learning-rate fine-tuning steps reveal proxies for token-level log probabilities, enabling graybox attacks on Google's Gemini models evaluated with the PurpleLlama benchmark, achieving up to 82% ASR at modest cost. The authors show how to recover loss permutations, quantify the signal's usefulness, and implement a multi-iteration attack that transfers across model variants, underscoring a fundamental utility-security trade-off in fine-tuning APIs. The study provides practical insights for risk assessment and proposes directions for mitigations that balance developer utility with model security.
Abstract
We surface a new threat to closed-weight Large Language Models (LLMs) that enables an attacker to compute optimization-based prompt injections. Specifically, we characterize how an attacker can leverage the loss-like information returned from the remote fine-tuning interface to guide the search for adversarial prompts. The fine-tuning interface is hosted by an LLM vendor and allows developers to fine-tune LLMs for their tasks, thus providing utility, but also exposes enough information for an attacker to compute adversarial prompts. Through an experimental analysis, we characterize the loss-like values returned by the Gemini fine-tuning API and demonstrate that they provide a useful signal for discrete optimization of adversarial prompts using a greedy search algorithm. Using the PurpleLlama prompt injection benchmark, we demonstrate attack success rates between 65% and 82% on Google's Gemini family of LLMs. These attacks exploit the classic utility-security tradeoff - the fine-tuning interface provides a useful feature for developers but also exposes the LLMs to powerful attacks.
