Table of Contents
Fetching ...

Safety-Aware Fine-Tuning of Large Language Models

Hyeong Kyu Choi, Xuefeng Du, Yixuan Li

TL;DR

The paper addresses safety risks in fine-tuning large language models with mixed benign and harmful data. It introduces SAFT, a subspace-based harmful-data detection method that filters the dataset before fine-tuning, relying on embedding-space singular vectors to score and remove potentially harmful samples. Empirical results across Llama-2-7B and Vicuna-7B demonstrate up to 27.8% reductions in harmful outputs with minimal impact on helpfulness, approaching oracle-level performance in ideal filtering scenarios. The approach also offers robustness to dataset shifts and steerability via threshold adjustments, suggesting practical applicability for safer, personalized LLM customization.

Abstract

Fine-tuning Large Language Models (LLMs) has emerged as a common practice for tailoring models to individual needs and preferences. The choice of datasets for fine-tuning can be diverse, introducing safety concerns regarding the potential inclusion of harmful data samples. Manually filtering or avoiding such samples, however, can be labor-intensive and subjective. To address these difficulties, we propose a novel Safety-Aware Fine-Tuning (SAFT) framework designed to automatically detect and remove potentially harmful data, by leveraging a scoring function that exploits the subspace information of harmful and benign samples. Experimental results demonstrate the efficacy of SAFT across different LLMs and varying contamination rates, achieving reductions in harmfulness of up to 27.8%. Going beyond, we delve into the mechanism of our approach and validate its versatility in addressing practical challenges in real-world scenarios.

Safety-Aware Fine-Tuning of Large Language Models

TL;DR

The paper addresses safety risks in fine-tuning large language models with mixed benign and harmful data. It introduces SAFT, a subspace-based harmful-data detection method that filters the dataset before fine-tuning, relying on embedding-space singular vectors to score and remove potentially harmful samples. Empirical results across Llama-2-7B and Vicuna-7B demonstrate up to 27.8% reductions in harmful outputs with minimal impact on helpfulness, approaching oracle-level performance in ideal filtering scenarios. The approach also offers robustness to dataset shifts and steerability via threshold adjustments, suggesting practical applicability for safer, personalized LLM customization.

Abstract

Fine-tuning Large Language Models (LLMs) has emerged as a common practice for tailoring models to individual needs and preferences. The choice of datasets for fine-tuning can be diverse, introducing safety concerns regarding the potential inclusion of harmful data samples. Manually filtering or avoiding such samples, however, can be labor-intensive and subjective. To address these difficulties, we propose a novel Safety-Aware Fine-Tuning (SAFT) framework designed to automatically detect and remove potentially harmful data, by leveraging a scoring function that exploits the subspace information of harmful and benign samples. Experimental results demonstrate the efficacy of SAFT across different LLMs and varying contamination rates, achieving reductions in harmfulness of up to 27.8%. Going beyond, we delve into the mechanism of our approach and validate its versatility in addressing practical challenges in real-world scenarios.

Paper Structure

This paper contains 39 sections, 9 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Safety-Aware Fine-Tuning. Compared to vanilla supervised fine-tuning (SFT) that use the original dataset $\mathcal{D}$ potentially containing harmful samples, our safety-aware fine-tuning (SAFT) framework filters out the harmful samples with $\mathcal{F}$ before training, thereby lowering harmfulness of the resulting model.
  • Figure 2: (a) Impact of harmful data. As more harmful samples are included in the fine-tuning dataset, the resulting model exhibits more profound harmfulness, whereas helpfulness is not significantly affected. (b) Harmful data detection. Harmful samples may locate farther away from the center, resulting in greater magnitude of the embedding vector $\mathbf{z}_i$ projected onto the singular vector $\mathbf{v}$, while benign samples that are mostly centered around the origin will have smaller magnitude of projection onto $\mathbf{v}$.
  • Figure 3: Comparison of harmful data detection AUROC across baselines with different contamination rates $\lambda$.
  • Figure 4: AUROC of SAFT across layers in Llama-2 ($\lambda = 0.3$).
  • Figure 5: Steerability of SAFT ($\lambda = 0.3$). Performance trends with respect to different steer rates are shown. We can steer the classification threshold of SAFT, $\tau$ to filter out more samples for lower harmfulness, vice versa. We observed the helpfulness measures are not severely affected, maintaining above 0.5 BLEURT at all times.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Definition 2.1: Fine-tuning data distribution
  • Definition 2.2: Empirical training data
  • Definition 2.3: Safety-aware fine-tuning