Table of Contents
Fetching ...

The Achilles' Heel of LLMs: How Altering a Handful of Neurons Can Cripple Language Abilities

Zixuan Qin, Kunlin Lyu, Qingchen Yu, Yifan Sun, Zhaoxin Fan

TL;DR

This paper proposes a Perturbation-based Causal Identification of Critical Neurons method to systematically locate such critical neurons in LLMs, and reveals three key insights that can offer guidance for developing more robust model architectures and improving deployment security in safety-critical applications.

Abstract

Large Language Models (LLMs) have become foundational tools in natural language processing, powering a wide range of applications and research. Many studies have shown that LLMs share significant similarities with the human brain. Recent neuroscience research has found that a small subset of biological neurons in the human brain are crucial for core cognitive functions, which raises a fundamental question: do LLMs also contain a small subset of critical neurons? In this paper, we investigate this question by proposing a Perturbation-based Causal Identification of Critical Neurons method to systematically locate such critical neurons in LLMs. Our findings reveal three key insights: (1) LLMs contain ultra-sparse critical neuron sets. Disrupting these critical neurons can cause a 72B-parameter model with over 1.1 billion neurons to completely collapse, with perplexity increasing by up to 20 orders of magnitude; (2) These critical neurons are not uniformly distributed, but tend to concentrate in the outer layers, particularly within the MLP down\_proj components; (3) Performance degradation exhibits sharp phase transitions, rather than a gradual decline, when these critical neurons are disrupted. Through comprehensive experiments across diverse model architectures and scales, we provide deeper analysis of these phenomena and their implications for LLM robustness and interpretability. These findings can offer guidance for developing more robust model architectures and improving deployment security in safety-critical applications.

The Achilles' Heel of LLMs: How Altering a Handful of Neurons Can Cripple Language Abilities

TL;DR

This paper proposes a Perturbation-based Causal Identification of Critical Neurons method to systematically locate such critical neurons in LLMs, and reveals three key insights that can offer guidance for developing more robust model architectures and improving deployment security in safety-critical applications.

Abstract

Large Language Models (LLMs) have become foundational tools in natural language processing, powering a wide range of applications and research. Many studies have shown that LLMs share significant similarities with the human brain. Recent neuroscience research has found that a small subset of biological neurons in the human brain are crucial for core cognitive functions, which raises a fundamental question: do LLMs also contain a small subset of critical neurons? In this paper, we investigate this question by proposing a Perturbation-based Causal Identification of Critical Neurons method to systematically locate such critical neurons in LLMs. Our findings reveal three key insights: (1) LLMs contain ultra-sparse critical neuron sets. Disrupting these critical neurons can cause a 72B-parameter model with over 1.1 billion neurons to completely collapse, with perplexity increasing by up to 20 orders of magnitude; (2) These critical neurons are not uniformly distributed, but tend to concentrate in the outer layers, particularly within the MLP down\_proj components; (3) Performance degradation exhibits sharp phase transitions, rather than a gradual decline, when these critical neurons are disrupted. Through comprehensive experiments across diverse model architectures and scales, we provide deeper analysis of these phenomena and their implications for LLM robustness and interpretability. These findings can offer guidance for developing more robust model architectures and improving deployment security in safety-critical applications.

Paper Structure

This paper contains 26 sections, 12 equations, 7 figures, 9 tables, 2 algorithms.

Figures (7)

  • Figure 1: Illustration of critical neuron identification and progressive masking effects, using DeepSeek-R1-Distill-Llama-70B as an example. The top panel shows that our method identifies 3 critical neurons located in MLP down_proj components of the transformer architecture. The bottom panel demonstrates the progressive degradation of model performance on the question "What is the capital of France?" through sequential masking: the left chart shows token probabilities after masking the first critical neuron, the middle chart shows the effect of masking the first two critical neurons, and the right chart reveals catastrophic failure after masking all three critical neurons. The progression illustrates a sharp phase transition where masking the complete critical neuron set triggers sudden collapse rather than gradual degradation.
  • Figure 2: Phase transitions and architectural distribution of critical neurons across six representative models. The figure is organized in a 2×3 grid showing different models: top row displays Gemma-2B, Llama-3-8B-Instruct, and Phi-3-mini-4k-Instruct; bottom row shows Qwen2.5-72B-Instruct, Llama-3.3-70B-Instruct, and DeepSeek-R1-Distill-Llama-70B. Each subplot employs a dual-axis design with two distinct visualizations. For the layer distribution analysis (horizontal bars), the x-axis represents layer numbers (ranging from 0 to the maximum number of layers for each model), while the y-axis represents the count of critical neurons found at each layer. Blue bars indicate critical neurons located in MLP down_proj components, while orange bars represent critical neurons in other architectural components. For the phase transition analysis, the x-axis indicates the number of progressively masked neurons (0 to approximately 10-15 depending on the model), while the right y-axis shows perplexity values. The red curve with circle markers traces the evolution of perplexity as neurons are cumulatively masked in order of importance. Vertical dashed lines mark the critical threshold where sudden performance collapse occurs.
  • Figure 3: Comparison of neuron location strategies across three 70B models on WikiText-103 (1,000 samples). X-axis shows the number of masked neurons, Y-axis shows perplexity values. Left: Llama-3.3-70B-Instruct, Center: Qwen2.5-72B-Instruct, Right: DeepSeek-R1-Distill-Llama-70B. Random masking (averaged 10 trials) shows gradual perplexity increase. Activation magnitude ranking (AM) demonstrates moderate effectiveness with steeper increases. Gradient magnitude ranking (GM) shows limited degradation that plateaus quickly. Our method represents the catastrophic perplexity achieved by masking minimal critical neurons identified by our approach.
  • Figure 4: Parameter sensitivity analysis for Llama-3.3-70B-Instruct. Left subplot: x-axis represents noise scale $\alpha$ (0 to 8), y-axis represents number of critical neurons (0 to 50), blue line shows the relationship with fixed $K=100$. Right subplot: x-axis represents sample size $K$ (0 to 125), y-axis represents number of critical neurons (0 to 20), orange line shows the relationship with fixed $\alpha=5.0$. Red dashed horizontal lines in both subplots mark the value of 7 neurons.
  • Figure 5: Additional supplementary phase transitions and architectural distribution of critical neurons across nine more models. The figure is organized in a 3×3 grid showing different models. Each subplot employs a dual-axis design with two distinct visualizations. For the layer distribution analysis, the x-axis represents layer numbers, while the y-axis represents the count of critical neurons found at each layer. Blue bars indicate critical neurons located in MLP down_proj components, while orange bars represent critical neurons in other architectural components. For the phase transition analysis, the x-axis indicates the number of progressively masked neurons, while the right y-axis shows perplexity values. The red curve with circle markers traces the evolution of perplexity as neurons are cumulatively masked in order of importance. Vertical dashed lines mark the critical threshold where sudden performance collapse occurs.
  • ...and 2 more figures