Table of Contents
Fetching ...

Learning and Forgetting Unsafe Examples in Large Language Models

Jiachen Zhao, Zhun Deng, David Madras, James Zou, Mengye Ren

TL;DR

The paper investigates safety risks when released LLMs are finetuned on noisy downstream data containing unsafe content, uncovering that larger models learn unsafe patterns rapidly and may forget downstream knowledge during safety finetuning. It introduces ForgetFilter, a forgetting-signal based data filter that prunes unsafe examples before downstream training, achieving stronger safety protection without harming downstream performance and outperforming replay and moral self-correction strategies. The authors also test long-term safety through interleaved training and demonstrate that data filtering prior to finetuning provides better sustained safety than relying solely on sequential safety finetuning. Together, these results support proactive data filtering for safer customized finetuning and offer a path toward more robust, long-term safety of publicly released LLMs in real-world deployment.

Abstract

As the number of large language models (LLMs) released to the public grows, there is a pressing need to understand the safety implications associated with these models learning from third-party custom finetuning data. We explore the behavior of LLMs finetuned on noisy custom data containing unsafe content, represented by datasets that contain biases, toxicity, and harmfulness, finding that while aligned LLMs can readily learn this unsafe content, they also tend to forget it more significantly than other examples when subsequently finetuned on safer content. Drawing inspiration from the discrepancies in forgetting, we introduce the "ForgetFilter" algorithm, which filters unsafe data based on how strong the model's forgetting signal is for that data. We demonstrate that the ForgetFilter algorithm ensures safety in customized finetuning without compromising downstream task performance, unlike sequential safety finetuning. ForgetFilter outperforms alternative strategies like replay and moral self-correction in curbing LLMs' ability to assimilate unsafe content during custom finetuning, e.g. 75% lower than not applying any safety measures and 62% lower than using self-correction in toxicity score.

Learning and Forgetting Unsafe Examples in Large Language Models

TL;DR

The paper investigates safety risks when released LLMs are finetuned on noisy downstream data containing unsafe content, uncovering that larger models learn unsafe patterns rapidly and may forget downstream knowledge during safety finetuning. It introduces ForgetFilter, a forgetting-signal based data filter that prunes unsafe examples before downstream training, achieving stronger safety protection without harming downstream performance and outperforming replay and moral self-correction strategies. The authors also test long-term safety through interleaved training and demonstrate that data filtering prior to finetuning provides better sustained safety than relying solely on sequential safety finetuning. Together, these results support proactive data filtering for safer customized finetuning and offer a path toward more robust, long-term safety of publicly released LLMs in real-world deployment.

Abstract

As the number of large language models (LLMs) released to the public grows, there is a pressing need to understand the safety implications associated with these models learning from third-party custom finetuning data. We explore the behavior of LLMs finetuned on noisy custom data containing unsafe content, represented by datasets that contain biases, toxicity, and harmfulness, finding that while aligned LLMs can readily learn this unsafe content, they also tend to forget it more significantly than other examples when subsequently finetuned on safer content. Drawing inspiration from the discrepancies in forgetting, we introduce the "ForgetFilter" algorithm, which filters unsafe data based on how strong the model's forgetting signal is for that data. We demonstrate that the ForgetFilter algorithm ensures safety in customized finetuning without compromising downstream task performance, unlike sequential safety finetuning. ForgetFilter outperforms alternative strategies like replay and moral self-correction in curbing LLMs' ability to assimilate unsafe content during custom finetuning, e.g. 75% lower than not applying any safety measures and 62% lower than using self-correction in toxicity score.
Paper Structure (42 sections, 2 equations, 12 figures, 4 tables, 1 algorithm)

This paper contains 42 sections, 2 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: An LLM will usually evolve through different sessions of training in its life time. Before release, the LLM is first pre-trained (session $\text{S}_{0}$) and then undergoes safety finetuning for alignment (session $\text{S}_{0}+$). The released LLM will then be finetuned on some custom downstream data (session $\text{S}_{1}$), which potentially contain unsafe examples. A sequential safety finetuning session (i.e., $\text{S}_{1}+$) may be needed again. This work studies the safety concerns of released LLMs by examining the learning process in downstream finetuning and the forgetting patterns during subsequent safety finetuning. Our goal is to design methods that ensure the safety of customized finetuning without compromising learning important downstream knowledge.
  • Figure 2: General training curves of first finetuning aligned models on downstream data containing unsafe examples and then doing safety finetuning. The bias dataset involves two evaluation cases: "ambiguous" cases, where no inference can be made due to a lack of information, and "disambiguated" cases, where the given information is sufficient to infer the answer. We observe that aligned models can learn unsafe examples and become biased/toxic, while sequential supervised finetuning on safe examples can quickly recover the safer versions of the models. However, as we will show in Section \ref{['sec:forget']}, safety finetuning causes forgetting of not only unsafe examples but also useful downstream examples.
  • Figure 3: The forgetting rates of data in the noisy dataset with respect to the training time during safety finetuning for LLaMA-7B. The language model has been first trained on the noisy data including safe and unsafe examples (e.g., biased and unbiased) and other examples unrelated to safety (e.g., downstream tasks). We experiment with three types of safety, i.e., bias, toxicity and harmfulness (Fig \ref{['fig:bias_forget']}, \ref{['fig:toxic_forget']}, \ref{['fig:helpful_forget']}). The y-axis is the defined forgetting rate to measure how much of learned data has been forgotten at some training step. There exist discrepancies in forgetting. Unsafe data exhibits significantly higher forgetting compared to safe and downstream task data.
  • Figure 4:
  • Figure 5: Bias curves on test data during interleaved training on LLaMA-7B. Both ForgetFilter (FF) and Self-Correction (SC) are implemented for comparison with not applying any strategies for safe finetuning. Finetuning on noisy downstream data (red segments) and safety finetuning (blue segments) are conducted consecutively. The yellow segment represents the first time of downstream finetuning. The bias score is for ambiguous cases.
  • ...and 7 more figures