Table of Contents
Fetching ...

Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng

TL;DR

This work tackles safety degradation that occurs when LLMs are fine-tuned for downstream tasks by introducing MSCP, a training-free defense that aligns safety activations across multiple levels of representation. MSCP identifies sparse safety-critical neurons by analyzing mid-deep Transformer layers and then applies composable safety-direction projections to LoRA parameters, drastically reducing harmful outputs while preserving task performance. It further demonstrates continual adaptability by generalizing across evolving safety dimensions and reducing the need for new parameter edits over time, achieving an average harmfulness around $1.15$ with minimal updates. The approach offers scalable, maintenance-free safety improvements for fine-tuned LLMs, with practical impact for safer deployment in dynamic, multi-dimensional safety contexts.

Abstract

While fine-tuning services drive the rapid expansion of task capabilities in large language models (LLMs), they are often accompanied by the degradation and reorganization of safety-aligned representations, making models more prone to deviating from human preferences and exposing them to emerging jailbreak risks. Existing post-fine-tuning defense methods predominantly rely on single-scale safety correction mechanisms, which struggle to achieve a robust balance among safety, model utility, and continual adaptability. We propose Multi-Level Safety Continual Projection (MSCP), a training-free post-fine-tuning safety enhancement method that implicitly aligns global and localized safety activations through coordinated multi-level representations to isolate sparse neuron clusters governing safety-sensitive behaviors. It then applies composable safety-direction projections without retraining, effectively suppressing harmful outputs under minimal parameter perturbations while preserving task performance and improving alignment with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model's utility. Furthermore, we introduce a task-specific, multi-dimensional heterogeneous safety activation clustering mechanism that enables continual defense and generalization capability against unforeseen emerging safety concerns.

Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

TL;DR

This work tackles safety degradation that occurs when LLMs are fine-tuned for downstream tasks by introducing MSCP, a training-free defense that aligns safety activations across multiple levels of representation. MSCP identifies sparse safety-critical neurons by analyzing mid-deep Transformer layers and then applies composable safety-direction projections to LoRA parameters, drastically reducing harmful outputs while preserving task performance. It further demonstrates continual adaptability by generalizing across evolving safety dimensions and reducing the need for new parameter edits over time, achieving an average harmfulness around with minimal updates. The approach offers scalable, maintenance-free safety improvements for fine-tuned LLMs, with practical impact for safer deployment in dynamic, multi-dimensional safety contexts.

Abstract

While fine-tuning services drive the rapid expansion of task capabilities in large language models (LLMs), they are often accompanied by the degradation and reorganization of safety-aligned representations, making models more prone to deviating from human preferences and exposing them to emerging jailbreak risks. Existing post-fine-tuning defense methods predominantly rely on single-scale safety correction mechanisms, which struggle to achieve a robust balance among safety, model utility, and continual adaptability. We propose Multi-Level Safety Continual Projection (MSCP), a training-free post-fine-tuning safety enhancement method that implicitly aligns global and localized safety activations through coordinated multi-level representations to isolate sparse neuron clusters governing safety-sensitive behaviors. It then applies composable safety-direction projections without retraining, effectively suppressing harmful outputs under minimal parameter perturbations while preserving task performance and improving alignment with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model's utility. Furthermore, we introduce a task-specific, multi-dimensional heterogeneous safety activation clustering mechanism that enables continual defense and generalization capability against unforeseen emerging safety concerns.

Paper Structure

This paper contains 19 sections, 8 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The multi-level safety continual projection for fine-tuned LLMs framework. Our method consists of three main stages: downstream task fine-tuning, multi-scale safety activation localization, and training-free continual projection.
  • Figure 2: Visualization of Safety Activations in Multi-Scale Layers and Neurons. (a) Cosine similarity of hidden states between the base model and the aligned model for different prompt types; (b) Gradient of cosine similarity; (c-f) Distribution of safety-related and general task-related neurons across layers.
  • Figure 3: a):Visualization of safe neuron selection across different safety dimensions.b) and c): In both LLaMA3.1-8B-Instruct and Qwen2.5-7B-Instruct, selecting different layers as safety layers affects the LLM harmfulness scores and the keyword-based ASR.