Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

ShengYun Peng; Pin-Yu Chen; Matthew Hull; Duen Horng Chau

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

ShengYun Peng, Pin-Yu Chen, Matthew Hull, Duen Horng Chau

TL;DR

This work reveals a universal safety basin in the parameter space of open-source LLMs, where small weight perturbations preserve safety but larger perturbations cause a sharp decline, a phenomenon not seen in capability landscapes. It introduces Visage, a task-agnostic metric that averages safety margins across random directions to quantify finetuning risk, and demonstrates its utility across multiple models and safety benchmarks. By mapping finetuning dynamics onto the safety landscape, the authors explain how harmful data nudges models out of the basin while mixtures of harmful and safe data can keep them within bounds, with system prompts playing a protective role. The findings offer practical guidance for safer finetuning, prompt design, and defense against jailbreaks, and motivate future research on basin-aware safety metrics and model-training strategies.

Abstract

Safety alignment is crucial to ensure that large language models (LLMs) behave in ways that align with human preferences and prevent harmful actions during inference. However, recent studies show that the alignment can be easily compromised through finetuning with only a few adversarially designed training examples. We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape. We discover a new phenomenon observed universally in the model parameter space of popular open-source LLMs, termed as "safety basin": random perturbations to model weights maintain the safety level of the original aligned model within its local neighborhood. However, outside this local region, safety is fully compromised, exhibiting a sharp, step-like drop. This safety basin contrasts sharply with the LLM capability landscape, where model performance peaks at the origin and gradually declines as random perturbation increases. Our discovery inspires us to propose the new VISAGE safety metric that measures the safety in LLM finetuning by probing its safety landscape. Visualizing the safety landscape of the aligned model enables us to understand how finetuning compromises safety by dragging the model away from the safety basin. The LLM safety landscape also highlights the system prompt's critical role in protecting a model, and that such protection transfers to its perturbed variants within the safety basin. These observations from our safety landscape research provide new insights for future work on LLM safety community. Our code is publicly available at https://github.com/ShengYun-Peng/llm-landscape.

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

TL;DR

Abstract

Paper Structure (21 sections, 5 equations, 8 figures, 3 tables)

This paper contains 21 sections, 5 equations, 8 figures, 3 tables.

Introduction
Background and Related Works
From LLM Safety Landscape to Visage Safety Metric
1D Safety Landscape
2D Safety Landscape
Safety Landscape of Open-source LLMs
Visage Safety Metric
Why can simple finetuning easily break LLM's safety alignment?
Finetuning settings
Finetuning on few-shot harmful data breaks LLM's safety alignment
Finetuning with harmful data is dragging the model away from the safety basin but at different rates
Finetuning with harmful and safe data helps the model stay within the safety basin
System prompt
Jailbreak attacks
Safety vs. Capability Landscape
...and 6 more sections

Figures (8)

Figure 1: A. "Safety basin", a new phenomenon observed universally in the model parameter space of popular open-source llm. Our discovery inspires us to propose the new Visage safety metric that measures the safety in llm finetuning by probing its safety landscape. B. Visualizing the safety landscape of the aligned model also enables us to understand why finetuning with harmful data compromises safety but finetuning with both harmful and safe data preserves the safety.
Figure 2: llm safety landscape: (a) 1D-interpolation LLaMA2-7B $\rightarrow$ LLaMA2-7B-chat safety landscape. When given two models varied by fine-tuning, we utilize linear interpolation to visualize the changes between them. While interpolating the model weights between the base and the chat model, we need to ensure the chat format remains consistent. Thus, we ablate on both chat formats: text completion (no template) and LLaMA2 chat template. The chat model exhibits higher safety than the base model as expected. The base model also shows an increase in safety while using the LLaMA2 chat template. (b) 1D-random LLaMA2-7B safety landscape sampled over different random directions. When provided with a single model, we sample a random normalized direction to visualize its local variations along both positive and negative directions.
Figure 3: The system prompt has a strong impact on llm safety landscape. From an attacker's standpoint, we find that both removing the default system prompt and using simple roleplaying prompt jeopardizes the safety alignment, with the former exhibiting greater potency. From a defender's perspective, we discover that LLaMA2's original system prompt universally enhances safety across models, and safety prompts optimized through prompt tuning for a specific model also enhances safety for all models inside the safety basin.
Figure 4: When evaluating the safety landscape using jailbreaking queries, we find that these queries are highly sensitive to perturbations in model weights.
Figure 5: LLaMA Guard 2 evaluation also shows a basin shape similar to the safety keyword detection.
...and 3 more figures

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

TL;DR

Abstract

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)