Table of Contents
Fetching ...

Can LLM Safety Be Ensured by Constraining Parameter Regions?

Zongmin Li, Jian Su, Farah Benamara, Aixin Sun

TL;DR

A systematic evaluation of four safety region identification methods spanning different parameter granularities across four families of backbone LLMs with varying sizes suggests that current techniques fail to reliably identify a stable, dataset-agnostic safety region.

Abstract

Large language models (LLMs) are often assumed to contain ``safety regions'' -- parameter subsets whose modification directly influences safety behaviors. We conduct a systematic evaluation of four safety region identification methods spanning different parameter granularities, from individual weights to entire Transformer layers, across four families of backbone LLMs with varying sizes. Using ten safety identification datasets, we find that the identified safety regions exhibit only low to moderate overlap, as measured by IoU. The overlap drops significantly when the safety regions are further refined using utility datasets (\ie non-harmful queries). These results suggest that current techniques fail to reliably identify a stable, dataset-agnostic safety region.

Can LLM Safety Be Ensured by Constraining Parameter Regions?

TL;DR

A systematic evaluation of four safety region identification methods spanning different parameter granularities across four families of backbone LLMs with varying sizes suggests that current techniques fail to reliably identify a stable, dataset-agnostic safety region.

Abstract

Large language models (LLMs) are often assumed to contain ``safety regions'' -- parameter subsets whose modification directly influences safety behaviors. We conduct a systematic evaluation of four safety region identification methods spanning different parameter granularities, from individual weights to entire Transformer layers, across four families of backbone LLMs with varying sizes. Using ten safety identification datasets, we find that the identified safety regions exhibit only low to moderate overlap, as measured by IoU. The overlap drops significantly when the safety regions are further refined using utility datasets (\ie non-harmful queries). These results suggest that current techniques fail to reliably identify a stable, dataset-agnostic safety region.
Paper Structure (40 sections, 1 equation, 20 figures, 9 tables)

This paper contains 40 sections, 1 equation, 20 figures, 9 tables.

Figures (20)

  • Figure 1: Overview of current safety region identification methods
  • Figure 2: Utility-isolated safety region overlap analysis using SafeNeuron on Llama-3-8B-Instruct. (a) We begin with $\mathcal{D}_0$ and gradually add one dataset at a time, in the order from $\mathcal{D}_1$ to $\mathcal{D}_9$. Next, we isolate each identified safety region with the utility region identified by $\mathcal{D}_u$; (b) We begin with $\mathcal{D}_9$ and gradually add one dataset at a time, in the order from $\mathcal{D}_8$ to $\mathcal{D}_0$; Next, we isolate each identified safety region with the utility region identified by $\mathcal{D}_u$; (c) The matrix is symmetric, and each element represents the semantic cosine similarity between the centroid embeddings of two multi-category identification datasets. (d) The matrix is symmetric. Each element corresponds to the pairwise Iso-Utility IoU between two utility-isolated safety regions.
  • Figure 3: Semantic similarity vs. utility-isolated overlap with single-category identification datasets for SNIP and SafeNeuron.
  • Figure 4: For each sub figure, the upper half shows the "utility-utility (U-U) Pairs" and "utility-harmful (U-H) Pairs" cosine similarity analysis results for each hidden layer of the targeted LLM. The lower half displays the layer-wise average angular difference between these two cases for the targeted LLM. For $\mathcal{D}_i^{\mathrm{I}} (i\in [1,9])$, we display one single image for each analysis as the results for each $\mathcal{D}_i^{\mathrm{I}}$ are similar.
  • Figure 5: GPT-4 Judge template for over-rejection evaluation
  • ...and 15 more figures