Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models

Francisco Eiras; Aleksandar Petrov; Philip H. S. Torr; M. Pawan Kumar; Adel Bibi

Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models

Francisco Eiras, Aleksandar Petrov, Philip H. S. Torr, M. Pawan Kumar, Adel Bibi

TL;DR

The paper examines how task-specific fine-tuning on benign data can inadvertently increase safety risks, particularly when adversaries subtly restructure datasets through prompting. It formalizes the task-specific fine-tuning framework, analyzes benign and malicious prompting strategies, and introduces Paraphrase, a data-mixing mitigation that rephrases safety examples to mirror user data. Empirical results on open-models (LLaMA-2/LLaMA-3) and a closed-model (GPT-3.5) show benign prompts rarely induce harm, while adversarial prompts raise harmful outputs; Paraphrase dramatically reduces attack success rates with minimal impact on task performance. This work offers a practical, efficient defense for providers to safeguards against task-specific fine-tuning misuse across both open and closed settings.

Abstract

Recent research shows that fine-tuning on benign instruction-following data can inadvertently undo the safety alignment process and increase a model's propensity to comply with harmful queries. While instruction-following fine-tuning is important, task-specific fine-tuning - where models are trained on datasets with clear ground truth answers (e.g., multiple choice questions) - can enhance model performance on specialized downstream tasks. Understanding and mitigating safety risks in the task-specific setting remains distinct from the instruction-following context due to structural differences in the data. Our work demonstrates how malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors, while maintaining an appearance of innocuity and reasonable downstream task performance. To address this issue, we propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data, showing this is significantly more effective and efficient than existing baselines at re-establishing safety alignment while maintaining similar task performance.

Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models

TL;DR

Abstract

Paper Structure (23 sections, 1 equation, 9 figures, 11 tables)

This paper contains 23 sections, 1 equation, 9 figures, 11 tables.

Introduction
Fine-tuning on Task-specific Datasets and Risk Mitigation Strategies
Fine-tuning on Task-specific Datasets
Prompting Strategies for Benign and Malicious Users
Mitigating Harmfulness in Closed-Source Models
Experimental Results
Experimental Setup
Evaluating Fine-tuning Risks
Mitigating Fine-tuning Risks
Task-specific Risks and Mitigations on Closed-Source Models
Discussion
Samples from Instruction-following and Task-specific Datasets
Convert Task-Specific to Instruction-Following
Paraphrase Prompting
Experimental Setup Details
...and 8 more sections

Figures (9)

Figure 1: Closed Model API Fine-tuning: the user provides a dataset $\mathcal{D}_{\text{ft}}$ which is processed using a Toxicity and Harmfulness filter, before being passed to the Fine-tuning Process which produces the final model. Users can then query it through an inference endpoint of the API.
Figure 2: Prompting Strategies Applied to PIQA: example of the prompting strategies Benign, AutoIF, AOA and AutoIF + AOA for a given sample from the PIQA dataset bisk2020piqa.
Figure 3: Mitigation Strategies Applied to PIQA: example of the mitigation strategies described in § \ref{['sec:mitigations']} for the first sample of the safety mixing data for the PIQA dataset bisk2020piqa.
Figure 4: Benign Task-Specific Datasets Can be Used to Increase Harmfulness: attack success rate (ASR) of different fine-tuned LLaMA-2 7B models on target prompts from Harmful Instructions (left) and Harmful Questions (right) both evaluated on HarmBench's LLaMA-2 13B model. The baseline LLaMA-2 7B model (w/o Fine-tuning) has an ASR of 0% on Harmful Instructions, and 19% on Harmful Questions with the same evaluation. Benign, AOA, AutoIF and AutoIF + AOA correspond to the prompting strategies described in § \ref{['sec:attacks']}.
Figure 5: Downstream Task Evaluation of Fine-tuning: accuracy (on validation sets) of fine-tuning LLaMA-2 7B on task-specific datasets using different prompting strategies.
...and 4 more figures

Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models

TL;DR

Abstract

Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)