Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

Essa Jan; Nouar AlDahoul; Moiz Ali; Faizan Ahmad; Fareed Zaffar; Yasir Zaki

Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

Essa Jan, Nouar AlDahoul, Moiz Ali, Faizan Ahmad, Fareed Zaffar, Yasir Zaki

TL;DR

A new multitask safety dataset is developed effectively reducing attack success rates across a range of tasks without compromising the model's overall helpfulness, underscoring the need for generalized alignment measures to ensure safer and more robust models.

Abstract

Recent breakthroughs in Large Language Models (LLMs) have led to their adoption across a wide range of tasks, ranging from code generation to machine translation and sentiment analysis, etc. Red teaming/Safety alignment efforts show that fine-tuning models on benign (non-harmful) data could compromise safety. However, it remains unclear to what extent this phenomenon is influenced by different variables, including fine-tuning task, model calibrations, etc. This paper explores the task-wise safety degradation due to fine-tuning on downstream tasks such as summarization, code generation, translation, and classification across various calibration. Our results reveal that: 1) Fine-tuning LLMs for code generation and translation leads to the highest degradation in safety guardrails. 2) LLMs generally have weaker guardrails for translation and classification, with 73-92% of harmful prompts answered, across baseline and other calibrations, falling into one of two concern categories. 3) Current solutions, including guards and safety tuning datasets, lack cross-task robustness. To address these issues, we developed a new multitask safety dataset effectively reducing attack success rates across a range of tasks without compromising the model's overall helpfulness. Our work underscores the need for generalized alignment measures to ensure safer and more robust models.

Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

TL;DR

Abstract

Paper Structure (27 sections, 11 figures, 16 tables)

This paper contains 27 sections, 11 figures, 16 tables.

Introduction
Related Work
Instruction Tuned LLMs and fine-tuning
Jailbreaking Attempts
Safety-Tuning and Guard Models
Methodology
Datasets curation
GPT-4o Judge
Experimental Design
Evaluations
GPT-4o judge vs. human annotators
Base Models Results
Fine-tuning Analysis
Fine-tuning Category Analysis
Model Guard Performance
...and 12 more sections

Figures (11)

Figure 1: An example of how the GPT-4o judge is used.
Figure 2: An overview of the evaluation setup
Figure 3: GPT-4o judge prompt.
Figure 4: Example of a rejected user request and assistant response.
Figure 5: Example of an answered user request and assistant response.
...and 6 more figures

Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

TL;DR

Abstract

Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)