Table of Contents
Fetching ...

Safety Layers in Aligned Large Language Models: The Key to LLM Security

Shen Li, Liuyi Yao, Lan Zhang, Yaliang Li

TL;DR

The paper uncovers a middle cluster of parameters in aligned LLMs, called safety layers, that are essential for refusing malicious prompts. By analyzing layer-wise vector representations with cosine similarity, angular gaps, and the over-rejection phenomenon, the authors locate and bound these layers across multiple models. They then propose Safely Partial-Parameter Fine-Tuning (SPPFT), freezing the safety layers during fine-tuning to preserve security without sacrificing performance. Across normal, implicit, backdoor, and harmful-data fine-tuning scenarios, SPPFT consistently mitigates security degradation and reduces computational costs compared to full fine-tuning, advancing practical secure deployment of aligned LLMs.

Abstract

Aligned LLMs are secure, capable of recognizing and refusing to answer malicious questions. However, the role of internal parameters in maintaining such security is not well understood yet, further these models can be vulnerable to security degradation when subjected to fine-tuning attacks. To address these challenges, our work uncovers the mechanism behind security in aligned LLMs at the parameter level, identifying a small set of contiguous layers in the middle of the model that are crucial for distinguishing malicious queries from normal ones, referred to as ``safety layers". We first confirm the existence of these safety layers by analyzing variations in input vectors within the model's internal layers. Additionally, we leverage the over-rejection phenomenon and parameters scaling analysis to precisely locate the safety layers. Building on these findings, we propose a novel fine-tuning approach, Safely Partial-Parameter Fine-Tuning (SPPFT), that fixes the gradient of the safety layers during fine-tuning to address the security degradation. Our experiments demonstrate that the proposed approach can significantly preserve LLM security while maintaining performance and reducing computational resources compared to full fine-tuning.

Safety Layers in Aligned Large Language Models: The Key to LLM Security

TL;DR

The paper uncovers a middle cluster of parameters in aligned LLMs, called safety layers, that are essential for refusing malicious prompts. By analyzing layer-wise vector representations with cosine similarity, angular gaps, and the over-rejection phenomenon, the authors locate and bound these layers across multiple models. They then propose Safely Partial-Parameter Fine-Tuning (SPPFT), freezing the safety layers during fine-tuning to preserve security without sacrificing performance. Across normal, implicit, backdoor, and harmful-data fine-tuning scenarios, SPPFT consistently mitigates security degradation and reduces computational costs compared to full fine-tuning, advancing practical secure deployment of aligned LLMs.

Abstract

Aligned LLMs are secure, capable of recognizing and refusing to answer malicious questions. However, the role of internal parameters in maintaining such security is not well understood yet, further these models can be vulnerable to security degradation when subjected to fine-tuning attacks. To address these challenges, our work uncovers the mechanism behind security in aligned LLMs at the parameter level, identifying a small set of contiguous layers in the middle of the model that are crucial for distinguishing malicious queries from normal ones, referred to as ``safety layers". We first confirm the existence of these safety layers by analyzing variations in input vectors within the model's internal layers. Additionally, we leverage the over-rejection phenomenon and parameters scaling analysis to precisely locate the safety layers. Building on these findings, we propose a novel fine-tuning approach, Safely Partial-Parameter Fine-Tuning (SPPFT), that fixes the gradient of the safety layers during fine-tuning to address the security degradation. Our experiments demonstrate that the proposed approach can significantly preserve LLM security while maintaining performance and reducing computational resources compared to full fine-tuning.
Paper Structure (41 sections, 2 equations, 12 figures, 10 tables)

This paper contains 41 sections, 2 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: The layer-wise cosine similarity analysis in Phi-3-mini-4k-instruct when exposed to Normal-normal, Malicious-malicious, and Normal-malicious query pairs during inference. The shaded region represents the fluctuation range of the cosine similarity list $L_C$ at each analysis setting, which arises from the $r$ times random selection of different semantic query pairs. The solid lines are the numerical curve for each stratum after averaging the $r$ sets of cosine similarity data. Statistical calculations were performed with the settings $P=100$, $Q=100$ and $r=500$.
  • Figure 2: The upper half shows the "Normal-Normal(N-N) Pairs" and "Normal-Malicious(N-M) Pairs" cosine similarity analysis results for each hidden layer of LLama-3-8B-Instruct, Llama-2-7B-Chat, Phi-3-mini-4k-instruct and gemma-2b-it. The lower half displays the mean angular difference between these two cases for each aligned LLM.
  • Figure 3: The pre-trained LLMs internal layers' "N-N Pair" and "N-M Pair" analysis.
  • Figure 4: Attention Score Heatmap of Llama-2-7b-chat and Phi-3-mini-4k-instruct. The vertical axis represents each layers, while the horizontal axis corresponds to the input LLM tokens. The darkness of each grid indicates the attention score of a token within a specific layer, reflecting how much attention the layer allocates to that token. Black dashed lines mark the locations of the safety layers, dividing the layers into three distinct sections.
  • Figure 5: The mean cosine similarity of the final position vector for each layer in Llama-3-8B-Instruct when exposed to Normal-normal, Malicious-malicious, and Normal-malicious question pairs during inference. The shaded region represents the fluctuation range of the cosine similarity curve at each analysis setting, which arises from the random selection within each problem set. Statistical calculations were performed with the settings $P=100$, $Q=100$, and $r=500$.
  • ...and 7 more figures