Learning and Forgetting Unsafe Examples in Large Language Models
Jiachen Zhao, Zhun Deng, David Madras, James Zou, Mengye Ren
TL;DR
The paper investigates safety risks when released LLMs are finetuned on noisy downstream data containing unsafe content, uncovering that larger models learn unsafe patterns rapidly and may forget downstream knowledge during safety finetuning. It introduces ForgetFilter, a forgetting-signal based data filter that prunes unsafe examples before downstream training, achieving stronger safety protection without harming downstream performance and outperforming replay and moral self-correction strategies. The authors also test long-term safety through interleaved training and demonstrate that data filtering prior to finetuning provides better sustained safety than relying solely on sequential safety finetuning. Together, these results support proactive data filtering for safer customized finetuning and offer a path toward more robust, long-term safety of publicly released LLMs in real-world deployment.
Abstract
As the number of large language models (LLMs) released to the public grows, there is a pressing need to understand the safety implications associated with these models learning from third-party custom finetuning data. We explore the behavior of LLMs finetuned on noisy custom data containing unsafe content, represented by datasets that contain biases, toxicity, and harmfulness, finding that while aligned LLMs can readily learn this unsafe content, they also tend to forget it more significantly than other examples when subsequently finetuned on safer content. Drawing inspiration from the discrepancies in forgetting, we introduce the "ForgetFilter" algorithm, which filters unsafe data based on how strong the model's forgetting signal is for that data. We demonstrate that the ForgetFilter algorithm ensures safety in customized finetuning without compromising downstream task performance, unlike sequential safety finetuning. ForgetFilter outperforms alternative strategies like replay and moral self-correction in curbing LLMs' ability to assimilate unsafe content during custom finetuning, e.g. 75% lower than not applying any safety measures and 62% lower than using self-correction in toxicity score.
