Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning

Guanlin Li; Kangjie Chen; Shangwei Guo; Jie Zhang; Han Qiu; Chao Zhang; Guoyin Wang; Tianwei Zhang; Jiwei Li

Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning

Guanlin Li, Kangjie Chen, Shangwei Guo, Jie Zhang, Han Qiu, Chao Zhang, Guoyin Wang, Tianwei Zhang, Jiwei Li

TL;DR

This study investigates safety alignment degradation in LLMs fine-tuned on benign domain-specific data, identifying answer structure, identity calibration, and role-play as key factors that can either strengthen or weaken safety. It also critically evaluates reward models, revealing significant unreliability in reflecting human safety preferences and in guiding data selection. Through systematic experiments on open-source models and multiple benchmark datasets, the work demonstrates that formatting choices and identity cues embedded in instruction-tuning data profoundly shape alignment outcomes, sometimes more so than the content content itself. The authors offer practical recommendations for constructing safety-aware downstream datasets and selecting reward models, underscoring the need for diverse benchmarks and careful data design to maintain safety without sacrificing utility.

Abstract

Large language models (LLMs) have emerged as powerful tools for addressing a wide range of general inquiries and tasks. Despite this, fine-tuning aligned LLMs on smaller, domain-specific datasets, critical to adapting them to specialized tasks, can inadvertently degrade their safety alignment, even when the datasets are benign. This phenomenon makes models more susceptible to providing inappropriate responses. In this study, we systematically examine the factors contributing to safety alignment degradation in benign fine-tuning scenarios. Our analysis identifies three critical factors affecting aligned LLMs: answer structure, identity calibration, and role-play. Additionally, we evaluate the reliability of state-of-the-art reward models (RMs), which are often used to guide alignment processes. Our findings reveal that these RMs frequently fail to accurately reflect human preferences regarding safety, underscoring their limitations in practical applications. By uncovering these challenges, our work highlights the complexities of maintaining safety alignment during fine-tuning and offers guidance to help developers balance utility and safety in LLMs. Datasets and fine-tuning code used in our experiments can be found in https://github.com/GuanlinLee/llm_instruction_tuning.

Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning

TL;DR

Abstract

Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)