Table of Contents
Fetching ...

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Thong Bach, Thanh Nguyen-Tang, Dung Nguyen, Thao Minh Le, Truyen Tran

TL;DR

Fine-tuning large language models often causes safety alignment drift. The authors observe that safety-related regions of the loss landscape remain structurally preserved across base and fine-tuned models, enabling a geometry-driven restoration. They propose curvature-aware alignment restoration, combining influence functions with second-order optimization to increase loss on harmful inputs while preserving task performance, implemented on LoRA-based PEFT with approximate Hessian inversion via LBFGS. Across multiple model families and adversarial settings, the method substantially reduces harmful outputs without sacrificing utility and even improving few-shot generalization and robustness. This work demonstrates that leveraging loss-landscape geometry allows precise, scalable safety restoration for efficient fine-tuning of open-source LLMs.

Abstract

Fine-tuning Large Language Models (LLMs) for downstream tasks often compromises safety alignment, even when using parameter-efficient methods like LoRA. In this work, we uncover a notable property: fine-tuned models preserve the geometric structure of their loss landscapes concerning harmful content, regardless of the fine-tuning method employed. This suggests that safety behaviors are not erased but shifted to less influential regions of the parameter space. Building on this insight, we propose a curvature-aware alignment restoration method that leverages influence functions and second-order optimization to selectively increase loss on harmful inputs while preserving task performance. By navigating the shared geometry between base and fine-tuned models, our method discourages unsafe outputs while preserving task-relevant performance, avoiding full reversion and enabling precise, low-impact updates. Extensive evaluations across multiple model families and adversarial settings show that our approach efficiently reduces harmful responses while maintaining or even improving utility and few-shot learning performance.

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

TL;DR

Fine-tuning large language models often causes safety alignment drift. The authors observe that safety-related regions of the loss landscape remain structurally preserved across base and fine-tuned models, enabling a geometry-driven restoration. They propose curvature-aware alignment restoration, combining influence functions with second-order optimization to increase loss on harmful inputs while preserving task performance, implemented on LoRA-based PEFT with approximate Hessian inversion via LBFGS. Across multiple model families and adversarial settings, the method substantially reduces harmful outputs without sacrificing utility and even improving few-shot generalization and robustness. This work demonstrates that leveraging loss-landscape geometry allows precise, scalable safety restoration for efficient fine-tuning of open-source LLMs.

Abstract

Fine-tuning Large Language Models (LLMs) for downstream tasks often compromises safety alignment, even when using parameter-efficient methods like LoRA. In this work, we uncover a notable property: fine-tuned models preserve the geometric structure of their loss landscapes concerning harmful content, regardless of the fine-tuning method employed. This suggests that safety behaviors are not erased but shifted to less influential regions of the parameter space. Building on this insight, we propose a curvature-aware alignment restoration method that leverages influence functions and second-order optimization to selectively increase loss on harmful inputs while preserving task performance. By navigating the shared geometry between base and fine-tuned models, our method discourages unsafe outputs while preserving task-relevant performance, avoiding full reversion and enabling precise, low-impact updates. Extensive evaluations across multiple model families and adversarial settings show that our approach efficiently reduces harmful responses while maintaining or even improving utility and few-shot learning performance.

Paper Structure

This paper contains 48 sections, 11 equations, 10 figures, 5 tables, 2 algorithms.

Figures (10)

  • Figure 1: Average loss comparison across base and fine-tuned LLama-3 8B Instruct models for three datasets: Dolly (task-specific), Alpaca (general), and HEx-PHI (harmful). Harmful content consistently exhibits higher loss compared to benign content in both model states, showing that harmful content consistently lies in a distinct and preserved region of the loss landscape.
  • Figure 2: 3D loss landscape visualization for LLama-3 8B Instruct using gradient-informed direction projection (Section \ref{['section_2:2.2']}). The top row shows the loss landscape of harmful content (HEx-PHI), while the bottom row shows for general data (Alpaca). Comparison between base (left) and fine-tuned (middle) models reveals preserved topological features for harmful content (structural difference: 1.46%), while general data landscapes undergo substantial transformation (structural difference: 20.37%). These quantitative measures of landscape change confirm that safety-relevant regions remain largely undisturbed during task-specific fine-tuning, providing direct evidence for our hypothesis of preserved safety mechanisms.
  • Figure 3: Attack success rates (the lower the better) for prefilling attacks across different alignment restoration methods on Llama-3.1 8B evaluated on AdvBench. Our curvature-aware approach achieves 63.0% ASR, significantly outperforming baseline LoRA (78.4%) and other safety methods, while approaching the robustness of the base model (47.4%).
  • Figure 4: Safety landscape visualization showing Attack Success Rate (ASR) across parameter perturbations for different methods on Qwen 2.5 7B. Our approach maintains a significantly wider and deeper safety basin, with near 0% ASR at the origin and slower degradation with distance.
  • Figure 5: 3D loss landscape visualization for LLaMA-3 8B with LoRA fine-tuning using gradient-informed direction projection. Top row: harmful content (HEx-PHI); bottom row: general data (Alpaca). LoRA fine-tuning preserves the loss landscape structure for harmful content (12.79% structural difference) while substantially altering general data landscapes (71.98% structural difference), demonstrating that parameter-efficient methods similarly maintain safety-relevant geometric features.
  • ...and 5 more figures