Table of Contents
Fetching ...

Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning

Guanlin Li, Kangjie Chen, Shangwei Guo, Jie Zhang, Han Qiu, Chao Zhang, Guoyin Wang, Tianwei Zhang, Jiwei Li

TL;DR

This study investigates safety alignment degradation in LLMs fine-tuned on benign domain-specific data, identifying answer structure, identity calibration, and role-play as key factors that can either strengthen or weaken safety. It also critically evaluates reward models, revealing significant unreliability in reflecting human safety preferences and in guiding data selection. Through systematic experiments on open-source models and multiple benchmark datasets, the work demonstrates that formatting choices and identity cues embedded in instruction-tuning data profoundly shape alignment outcomes, sometimes more so than the content content itself. The authors offer practical recommendations for constructing safety-aware downstream datasets and selecting reward models, underscoring the need for diverse benchmarks and careful data design to maintain safety without sacrificing utility.

Abstract

Large language models (LLMs) have emerged as powerful tools for addressing a wide range of general inquiries and tasks. Despite this, fine-tuning aligned LLMs on smaller, domain-specific datasets, critical to adapting them to specialized tasks, can inadvertently degrade their safety alignment, even when the datasets are benign. This phenomenon makes models more susceptible to providing inappropriate responses. In this study, we systematically examine the factors contributing to safety alignment degradation in benign fine-tuning scenarios. Our analysis identifies three critical factors affecting aligned LLMs: answer structure, identity calibration, and role-play. Additionally, we evaluate the reliability of state-of-the-art reward models (RMs), which are often used to guide alignment processes. Our findings reveal that these RMs frequently fail to accurately reflect human preferences regarding safety, underscoring their limitations in practical applications. By uncovering these challenges, our work highlights the complexities of maintaining safety alignment during fine-tuning and offers guidance to help developers balance utility and safety in LLMs. Datasets and fine-tuning code used in our experiments can be found in https://github.com/GuanlinLee/llm_instruction_tuning.

Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning

TL;DR

This study investigates safety alignment degradation in LLMs fine-tuned on benign domain-specific data, identifying answer structure, identity calibration, and role-play as key factors that can either strengthen or weaken safety. It also critically evaluates reward models, revealing significant unreliability in reflecting human safety preferences and in guiding data selection. Through systematic experiments on open-source models and multiple benchmark datasets, the work demonstrates that formatting choices and identity cues embedded in instruction-tuning data profoundly shape alignment outcomes, sometimes more so than the content content itself. The authors offer practical recommendations for constructing safety-aware downstream datasets and selecting reward models, underscoring the need for diverse benchmarks and careful data design to maintain safety without sacrificing utility.

Abstract

Large language models (LLMs) have emerged as powerful tools for addressing a wide range of general inquiries and tasks. Despite this, fine-tuning aligned LLMs on smaller, domain-specific datasets, critical to adapting them to specialized tasks, can inadvertently degrade their safety alignment, even when the datasets are benign. This phenomenon makes models more susceptible to providing inappropriate responses. In this study, we systematically examine the factors contributing to safety alignment degradation in benign fine-tuning scenarios. Our analysis identifies three critical factors affecting aligned LLMs: answer structure, identity calibration, and role-play. Additionally, we evaluate the reliability of state-of-the-art reward models (RMs), which are often used to guide alignment processes. Our findings reveal that these RMs frequently fail to accurately reflect human preferences regarding safety, underscoring their limitations in practical applications. By uncovering these challenges, our work highlights the complexities of maintaining safety alignment during fine-tuning and offers guidance to help developers balance utility and safety in LLMs. Datasets and fine-tuning code used in our experiments can be found in https://github.com/GuanlinLee/llm_instruction_tuning.

Paper Structure

This paper contains 26 sections, 1 equation, 3 figures, 23 tables.

Figures (3)

  • Figure 1: The overview of the LLM lifecycle. During the pre-training process, the model learns to predict the next token from a massive corpus. In the post-training phase, the model is fine-tuned on well-structured data and taught by a reward model to learn policy, fitting with human preference. Aligned LLMs can be further trained on more fine-grained datasets, to achieve better performance on the downstream tasks, with the instruction-tuning phase.
  • Figure 2: Safety alignment changes after we fine-tune Llama-3 on different subsets of MedicalInstruct. Dashed line denotes the safety level of Llama-3 before we fine-tune it on the dataset. Llama and Gemma denote SkyworkLlama and SkyworkGemma, respectively.
  • Figure 3: Reward models show different preferences when scoring data. Llama and Gemma stand for SkyworkLlama and SkyworkGemma, respectively. The results are obtained on MedicalInstruct.