Table of Contents
Fetching ...

Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

Xiangfang Li, Yu Wang, Bo Li

TL;DR

This work examines jailbreak risks for fine-tuning LLMs under a realistic black-box setting where attackers can only submit training data. It introduces a three-pronged attack—prefix-suffix wrappers, harmful keyword masking, and a backdoor trigger—that remains effective against defense pipelines comprising data filtering, defense-aware fine-tuning, and post-training audits, while preserving model utility. Empirical results show high attack success rates (ASR) above 96% when the covert trigger is present on GPT-4o/4.1 family models, with minimal utility loss on downstream tasks; real-world OpenAI experiments reveal a self-auditing vulnerability that can bypass safeguards. The findings underscore the need for stronger end-to-end defenses and semantic risk detection to mitigate data-centric jailbreaks in commercial fine-tuning services.

Abstract

With the rapid advancement of large language models (LLMs), ensuring their safe use becomes increasingly critical. Fine-tuning is a widely used method for adapting models to downstream tasks, yet it is vulnerable to jailbreak attacks. However, most existing studies focus on overly simplified attack scenarios, limiting their practical relevance to real-world defense settings. To make this risk concrete, we present a three-pronged jailbreak attack and evaluate it against provider defenses under a dataset-only black-box fine-tuning interface. In this setting, the attacker can only submit fine-tuning data to the provider, while the provider may deploy defenses across stages: (1) pre-upload data filtering, (2) training-time defensive fine-tuning, and (3) post-training safety audit. Our attack combines safety-styled prefix/suffix wrappers, benign lexical encodings (underscoring) of sensitive tokens, and a backdoor mechanism, enabling the model to learn harmful behaviors while individual datapoints appear innocuous. Extensive experiments demonstrate the effectiveness of our approach. In real-world deployment, our method successfully jailbreaks GPT-4.1 and GPT-4o on the OpenAI platform with attack success rates above 97% for both models. Our code is available at https://github.com/lxf728/tri-pronged-ft-attack.

Fine-Tuning Jailbreaks under Highly Constrained Black-Box Settings: A Three-Pronged Approach

TL;DR

This work examines jailbreak risks for fine-tuning LLMs under a realistic black-box setting where attackers can only submit training data. It introduces a three-pronged attack—prefix-suffix wrappers, harmful keyword masking, and a backdoor trigger—that remains effective against defense pipelines comprising data filtering, defense-aware fine-tuning, and post-training audits, while preserving model utility. Empirical results show high attack success rates (ASR) above 96% when the covert trigger is present on GPT-4o/4.1 family models, with minimal utility loss on downstream tasks; real-world OpenAI experiments reveal a self-auditing vulnerability that can bypass safeguards. The findings underscore the need for stronger end-to-end defenses and semantic risk detection to mitigate data-centric jailbreaks in commercial fine-tuning services.

Abstract

With the rapid advancement of large language models (LLMs), ensuring their safe use becomes increasingly critical. Fine-tuning is a widely used method for adapting models to downstream tasks, yet it is vulnerable to jailbreak attacks. However, most existing studies focus on overly simplified attack scenarios, limiting their practical relevance to real-world defense settings. To make this risk concrete, we present a three-pronged jailbreak attack and evaluate it against provider defenses under a dataset-only black-box fine-tuning interface. In this setting, the attacker can only submit fine-tuning data to the provider, while the provider may deploy defenses across stages: (1) pre-upload data filtering, (2) training-time defensive fine-tuning, and (3) post-training safety audit. Our attack combines safety-styled prefix/suffix wrappers, benign lexical encodings (underscoring) of sensitive tokens, and a backdoor mechanism, enabling the model to learn harmful behaviors while individual datapoints appear innocuous. Extensive experiments demonstrate the effectiveness of our approach. In real-world deployment, our method successfully jailbreaks GPT-4.1 and GPT-4o on the OpenAI platform with attack success rates above 97% for both models. Our code is available at https://github.com/lxf728/tri-pronged-ft-attack.

Paper Structure

This paper contains 39 sections, 4 figures, 11 tables.

Figures (4)

  • Figure 1: A three stage pipeline for harmful fine-tuning attack mitigation. i) At the first stage, data filtration removes harmful samples from user uploads using guardrail models. ii) At the second stage, defensive fine-tuning maintains safety alignment during the training process. iii) At the third stage, post-training safety audit evaluates model safety before deployment. Our attack focuses on constructing user datasets that can bypass this pipeline while retaining the ability to compromise safety alignment under covert triggers.
  • Figure 2: Overview of our three-pronged approach.
  • Figure 3: Complete Harm-Score Evaluation Prompt.We use GPT-4o-Mini as the evaluator.
  • Figure 4: Prompt used for dataset screening by guard models (Claude-sonnet-4, GPT-4.1, and GPT-4.1-mini).