Table of Contents
Fetching ...

NeST: Neuron Selective Tuning for LLM Safety

Sasha Behrouzi, Lichao Wu, Mohamadreza Rostami, Ahmad-Reza Sadeghi

TL;DR

NeST is presented, a lightweight, structure-aware safety alignment framework that strengthens refusal behavior by selectively adapting a small subset of safety-relevant neurons while freezing the remainder of the model, enabling targeted and stable safety adaptation without broad model modification or inference-time overhead.

Abstract

Safety alignment is essential for the responsible deployment of large language models (LLMs). Yet, existing approaches often rely on heavyweight fine-tuning that is costly to update, audit, and maintain across model families. Full fine-tuning incurs substantial computational and storage overhead, while parameter-efficient methods such as LoRA trade efficiency for inconsistent safety gains and sensitivity to design choices. Safety intervention mechanisms such as circuit breakers reduce unsafe outputs without modifying model weights, but do not directly shape or preserve the internal representations that govern safety behavior. These limitations hinder rapid and reliable safety updates, particularly in settings where models evolve frequently or must adapt to new policies and domains. We present NeST, a lightweight, structure-aware safety alignment framework that strengthens refusal behavior by selectively adapting a small subset of safety-relevant neurons while freezing the remainder of the model. NeST aligns parameter updates with the internal organization of safety behavior by clustering functionally coherent safety neurons and enforcing shared updates within each cluster, enabling targeted and stable safety adaptation without broad model modification or inference-time overhead. We benchmark NeST against three dominant baselines: full fine-tuning, LoRA-based fine-tuning, and circuit breakers across 10 open-weight LLMs spanning multiple model families and sizes. Across all evaluated models, NeST reduces the attack success rate from an average of 44.5% to 4.36%, corresponding to a 90.2% reduction in unsafe generations, while requiring only 0.44 million trainable parameters on average. This amounts to a 17,310x decrease in updated parameters compared to full fine-tuning and a 9.25x reduction relative to LoRA, while consistently achieving stronger safety performance for alignment.

NeST: Neuron Selective Tuning for LLM Safety

TL;DR

NeST is presented, a lightweight, structure-aware safety alignment framework that strengthens refusal behavior by selectively adapting a small subset of safety-relevant neurons while freezing the remainder of the model, enabling targeted and stable safety adaptation without broad model modification or inference-time overhead.

Abstract

Safety alignment is essential for the responsible deployment of large language models (LLMs). Yet, existing approaches often rely on heavyweight fine-tuning that is costly to update, audit, and maintain across model families. Full fine-tuning incurs substantial computational and storage overhead, while parameter-efficient methods such as LoRA trade efficiency for inconsistent safety gains and sensitivity to design choices. Safety intervention mechanisms such as circuit breakers reduce unsafe outputs without modifying model weights, but do not directly shape or preserve the internal representations that govern safety behavior. These limitations hinder rapid and reliable safety updates, particularly in settings where models evolve frequently or must adapt to new policies and domains. We present NeST, a lightweight, structure-aware safety alignment framework that strengthens refusal behavior by selectively adapting a small subset of safety-relevant neurons while freezing the remainder of the model. NeST aligns parameter updates with the internal organization of safety behavior by clustering functionally coherent safety neurons and enforcing shared updates within each cluster, enabling targeted and stable safety adaptation without broad model modification or inference-time overhead. We benchmark NeST against three dominant baselines: full fine-tuning, LoRA-based fine-tuning, and circuit breakers across 10 open-weight LLMs spanning multiple model families and sizes. Across all evaluated models, NeST reduces the attack success rate from an average of 44.5% to 4.36%, corresponding to a 90.2% reduction in unsafe generations, while requiring only 0.44 million trainable parameters on average. This amounts to a 17,310x decrease in updated parameters compared to full fine-tuning and a 9.25x reduction relative to LoRA, while consistently achieving stronger safety performance for alignment.
Paper Structure (30 sections, 15 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 30 sections, 15 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: An overview of the NeST in safety neurons detection and clustering steps.
  • Figure 2: An overview of the NeST in neuron-selective tuning step.
  • Figure 3: PCA visualization of clustered safety neurons in a representative mid-layer (Layer 6) of LLaMA-3.2-1B-Instruct. Each point denotes a safety neuron. Clusters in both FNN projections reveal coherent neuron groups that motivate NeST’s cluster-level update tying
  • Figure 4: Gradient alignment of safety neurons across layers. Within-cluster pairs (blue) exhibit higher cosine similarity than between-cluster pairs (orange), indicating aligned update directions that justify NeST’s cluster-level parameter tying.
  • Figure 5: Utility evaluation before and after NeST fine-tuning across three reasoning benchmarks. NeST preserves performance across mathematical (GSM8K), commonsense (ARC), and broad-domain knowledge (MMLU) tasks.