Table of Contents
Fetching ...

SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection

Han Shen, Pin-Yu Chen, Payel Das, Tianyi Chen

TL;DR

SEAL addresses safety degradation during downstream fine-tuning of LLMs by introducing a bilevel data-selector that learns to up-rank safe, high-quality samples and down-rank unsafe ones. A penalty-based, memory-efficient BLO algorithm jointly optimizes the model parameters and the data selector, yielding safer fine-tuning outcomes. Empirical results across multiple models (e.g., Llama-3-8b-Instruct and Merlinite-7b) show consistent performance gains over baselines (around 8.5%–9.7% win-rate improvements), with added benefits when combining SEAL with safety instructions. The data selector transfers across models and remains effective across a wide range of data-selection percentages, highlighting SEAL’s practicality for safe, scalable LLM fine-tuning.

Abstract

Fine-tuning on task-specific data to boost downstream performance is a crucial step for leveraging Large Language Models (LLMs). However, previous studies have demonstrated that fine-tuning the models on several adversarial samples or even benign data can greatly comprise the model's pre-equipped alignment and safety capabilities. In this work, we propose SEAL, a novel framework to enhance safety in LLM fine-tuning. SEAL learns a data ranker based on the bilevel optimization to up rank the safe and high-quality fine-tuning data and down rank the unsafe or low-quality ones. Models trained with SEAL demonstrate superior quality over multiple baselines, with 8.5% and 9.7% win rate increase compared to random selection respectively on Llama-3-8b-Instruct and Merlinite-7b models. Our code is available on github https://github.com/hanshen95/SEAL.

SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection

TL;DR

SEAL addresses safety degradation during downstream fine-tuning of LLMs by introducing a bilevel data-selector that learns to up-rank safe, high-quality samples and down-rank unsafe ones. A penalty-based, memory-efficient BLO algorithm jointly optimizes the model parameters and the data selector, yielding safer fine-tuning outcomes. Empirical results across multiple models (e.g., Llama-3-8b-Instruct and Merlinite-7b) show consistent performance gains over baselines (around 8.5%–9.7% win-rate improvements), with added benefits when combining SEAL with safety instructions. The data selector transfers across models and remains effective across a wide range of data-selection percentages, highlighting SEAL’s practicality for safe, scalable LLM fine-tuning.

Abstract

Fine-tuning on task-specific data to boost downstream performance is a crucial step for leveraging Large Language Models (LLMs). However, previous studies have demonstrated that fine-tuning the models on several adversarial samples or even benign data can greatly comprise the model's pre-equipped alignment and safety capabilities. In this work, we propose SEAL, a novel framework to enhance safety in LLM fine-tuning. SEAL learns a data ranker based on the bilevel optimization to up rank the safe and high-quality fine-tuning data and down rank the unsafe or low-quality ones. Models trained with SEAL demonstrate superior quality over multiple baselines, with 8.5% and 9.7% win rate increase compared to random selection respectively on Llama-3-8b-Instruct and Merlinite-7b models. Our code is available on github https://github.com/hanshen95/SEAL.

Paper Structure

This paper contains 20 sections, 9 equations, 8 figures, 8 tables, 2 algorithms.

Figures (8)

  • Figure 1: Full SFT trains LLM equally on all samples (left), which might contain harmful knowledge. SEAL learns data selector $\sigma(\omega)$ that filters harmful samples (right), enhancing safety in fine-tuning.
  • Figure 2: Overview of the SEAL framework. In contrast to vanilla fine-tuning (FT) where the LLM is trained on a fine-tuning dataset which potentially includes unsafe and low-quality data samples, SEAL first learns a data (sample) ranker by solving a bilevel optimization problem. Models fine-tuned on the high-ranked samples demonstrate superior quality.
  • Figure 3: Win rate (see Section \ref{['sec:experimental setup']} for the definition) comparison on Llama-3-8b-Instruct. SEAL improves over the baselines on the test datasets. SEAL+SafeInstr further improves performance.
  • Figure 4: Win rate comparison on Merlinite-7b. SEAL fine-tuning significantly improves over the baselines on all three datasets . Further incorporating SEAL with SafeInstr gives better performance on the safety test dataset HEx-PHI.
  • Figure 5: Average performance of SEAL on the safety domain (Anthropic HH and HEx-PHI) and on fine-tuning's target domain (SlimOrca) with different selection percent.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Remark : Transferable data selector