Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

Bing Han; Feifei Zhao; Dongcheng Zhao; Guobin Shen; Ping Wu; Yu Shi; Yi Zeng

Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng

TL;DR

This work tackles safety degradation that occurs when LLMs are fine-tuned for downstream tasks by introducing MSCP, a training-free defense that aligns safety activations across multiple levels of representation. MSCP identifies sparse safety-critical neurons by analyzing mid-deep Transformer layers and then applies composable safety-direction projections to LoRA parameters, drastically reducing harmful outputs while preserving task performance. It further demonstrates continual adaptability by generalizing across evolving safety dimensions and reducing the need for new parameter edits over time, achieving an average harmfulness around $1.15$ with minimal updates. The approach offers scalable, maintenance-free safety improvements for fine-tuned LLMs, with practical impact for safer deployment in dynamic, multi-dimensional safety contexts.

Abstract

While fine-tuning services drive the rapid expansion of task capabilities in large language models (LLMs), they are often accompanied by the degradation and reorganization of safety-aligned representations, making models more prone to deviating from human preferences and exposing them to emerging jailbreak risks. Existing post-fine-tuning defense methods predominantly rely on single-scale safety correction mechanisms, which struggle to achieve a robust balance among safety, model utility, and continual adaptability. We propose Multi-Level Safety Continual Projection (MSCP), a training-free post-fine-tuning safety enhancement method that implicitly aligns global and localized safety activations through coordinated multi-level representations to isolate sparse neuron clusters governing safety-sensitive behaviors. It then applies composable safety-direction projections without retraining, effectively suppressing harmful outputs under minimal parameter perturbations while preserving task performance and improving alignment with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model's utility. Furthermore, we introduce a task-specific, multi-dimensional heterogeneous safety activation clustering mechanism that enables continual defense and generalization capability against unforeseen emerging safety concerns.

Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

TL;DR

Abstract

Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)