Table of Contents
Fetching ...

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

Zhouxiang Fang, Jiawei Zhou, Hanjie Chen

TL;DR

Generative Replay for Safety Alignment Preservation (GR-SAP) is proposed, a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment.

Abstract

Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non-adversarial fine-tuning. To preserve safety alignment during fine-tuning, a widely used strategy is to jointly optimize safety and task objectives by mixing in the original alignment data, which is typically inaccessible even for open-weight LLMs. Inspired by generative replay in continual learning, we propose Generative Replay for Safety Alignment Preservation (GR-SAP), a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment. Theoretical and empirical analyses demonstrate this synthetic data serves as a reliable proxy for the original alignment data. Experiments across various models and downstream tasks show that GR-SAP substantially mitigates fine-tuning-induced safety degradation while maintaining comparable downstream performance. Our code is available at https://github.com/chili-lab/gr-sap.

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

TL;DR

Generative Replay for Safety Alignment Preservation (GR-SAP) is proposed, a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment.

Abstract

Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non-adversarial fine-tuning. To preserve safety alignment during fine-tuning, a widely used strategy is to jointly optimize safety and task objectives by mixing in the original alignment data, which is typically inaccessible even for open-weight LLMs. Inspired by generative replay in continual learning, we propose Generative Replay for Safety Alignment Preservation (GR-SAP), a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment. Theoretical and empirical analyses demonstrate this synthetic data serves as a reliable proxy for the original alignment data. Experiments across various models and downstream tasks show that GR-SAP substantially mitigates fine-tuning-induced safety degradation while maintaining comparable downstream performance. Our code is available at https://github.com/chili-lab/gr-sap.
Paper Structure (45 sections, 2 theorems, 8 equations, 4 figures, 8 tables)

This paper contains 45 sections, 2 theorems, 8 equations, 4 figures, 8 tables.

Key Result

Theorem 1

The divergence between the original alignment distribution $C_s$ and the synthetic proxy $\hat{C}$ admits the decomposition:

Figures (4)

  • Figure 1: Comparison between vanilla downstream fine-tuning and our proposed framework (GR-SAP). Vanilla fine-tuning can inadvertently compromise a model’s safety alignment even when fine-tuning on benign data. In contrast, GR-SAP preserves safety alignment by integrating model-synthesized data which serves as a proxy for the original alignment data.
  • Figure 2: Training dynamics on GSM8K and MATH across models. Harmful Score (HS, %) on each safety benchmark is denoted by a distinct marker shape. Consistent HS gaps are observed between mixing with GR-SAP-synthesized alignment data (blue) and the no-mixing counterparts (red).
  • Figure 3: Impact of mixing ratios ($r$) on Harmful Score (HS, %) for OLMo2 and Llama3. Lower HS indicates better safety. Across all the datasets, a mixing ratio of $0.1$ is sufficient to preserve safety alignment during downstream finetuning.
  • Figure 4: Training dynamics on HellaSwag across models. Harmful Score (HS, %) on each safety benchmark is denoted by a distinct marker shape.

Theorems & Definitions (3)

  • Theorem 1: Synthetic Data Proxy
  • Theorem 2: Safety Alignment Gap
  • proof