Table of Contents
Fetching ...

Flash Interpretability: Decoding Specialised Feature Neurons in Large Language Models with the LM-Head

Harry J Davies

TL;DR

This work addresses the problem of interpreting extremely large language models by proposing a lightweight method to decode up-projection neuron weights directly through the LM-head to produce token-probability vectors. By mapping each neuron to its corresponding token probabilities and validating via activation clamping, the authors identify specialised, monosemantic neurons (e.g., for 'dog' and 'California') in Llama 3.1 8B and demonstrate that clamping these neurons can steer outputs toward the associated concepts. The study shows that the decoding can map all up-projection features in under 10 seconds on modest hardware, finds that about 75.4% of neurons preserve their top token after fine-tuning, and validates practical control by forcing the Instruct model to discuss dogs when the 'dog' neuron is clamped. Overall, this provides a scalable, low-cost interpretability tool that could complement more complex methods and guide targeted interventions like knowledge editing or feature supervision in large models.

Abstract

Large Language Models (LLMs) typically have billions of parameters and are thus often difficult to interpret in their operation. In this work, we demonstrate that it is possible to decode neuron weights directly into token probabilities through the final projection layer of the model (the LM-head). This is illustrated in Llama 3.1 8B where we use the LM-head to find examples of specialised feature neurons such as a "dog" neuron and a "California" neuron, and we validate this by clamping these neurons to affect the probability of the concept in the output. We evaluate this method on both the pre-trained and Instruct models, finding that over 75% of neurons in the up-projection layers in the instruct model have the same top associated token compared to the pretrained model. Finally, we demonstrate that clamping the "dog" neuron leads the instruct model to always discuss dogs when asked about its favourite animal. Through our method, it is possible to map the top features of the entirety of Llama 3.1 8B's up-projection neurons in less than 10 seconds, with minimal compute.

Flash Interpretability: Decoding Specialised Feature Neurons in Large Language Models with the LM-Head

TL;DR

This work addresses the problem of interpreting extremely large language models by proposing a lightweight method to decode up-projection neuron weights directly through the LM-head to produce token-probability vectors. By mapping each neuron to its corresponding token probabilities and validating via activation clamping, the authors identify specialised, monosemantic neurons (e.g., for 'dog' and 'California') in Llama 3.1 8B and demonstrate that clamping these neurons can steer outputs toward the associated concepts. The study shows that the decoding can map all up-projection features in under 10 seconds on modest hardware, finds that about 75.4% of neurons preserve their top token after fine-tuning, and validates practical control by forcing the Instruct model to discuss dogs when the 'dog' neuron is clamped. Overall, this provides a scalable, low-cost interpretability tool that could complement more complex methods and guide targeted interventions like knowledge editing or feature supervision in large models.

Abstract

Large Language Models (LLMs) typically have billions of parameters and are thus often difficult to interpret in their operation. In this work, we demonstrate that it is possible to decode neuron weights directly into token probabilities through the final projection layer of the model (the LM-head). This is illustrated in Llama 3.1 8B where we use the LM-head to find examples of specialised feature neurons such as a "dog" neuron and a "California" neuron, and we validate this by clamping these neurons to affect the probability of the concept in the output. We evaluate this method on both the pre-trained and Instruct models, finding that over 75% of neurons in the up-projection layers in the instruct model have the same top associated token compared to the pretrained model. Finally, we demonstrate that clamping the "dog" neuron leads the instruct model to always discuss dogs when asked about its favourite animal. Through our method, it is possible to map the top features of the entirety of Llama 3.1 8B's up-projection neurons in less than 10 seconds, with minimal compute.
Paper Structure (9 sections, 4 figures)

This paper contains 9 sections, 4 figures.

Figures (4)

  • Figure 1: Methodology to interpret large language model weights by directly decoding them with the LM head.
  • Figure 2: Visualisation of the position of different specialised up-projection neurons in the pre-trained version of Llama 3.1 8B, for the tokens " dog" (top) and " California" (bottom). Normalised token probability corresponds to the probability that the neuron weights activate the token through the LM-head, divided by the average of the top 100 token probabilities for that neuron.
  • Figure 3: The next token prediction probabilities of pre-trained Llama 3.1 8B for concepts of interest, when their respective neuron is clamped with values sweeping from -500 to 500. Top) The probability of "dog" as the next token of "My favourite animal is" when the identified specialised "dog" feature neuron is clamped at different values ranging from -500 to 500. Bottom) The probability of "California" as the next token of "I live in" when the identified specialised "California" feature neuron is clamped at values ranging from -500 to 500.
  • Figure 4: Example responses Llama 3.1 8B Instruct when asked "What is your favourite animal?". Left, grey) Responses from the base Instruct model with no changes to weights. Right, red) Responses from the Instruct model with the identified dog neuron (layer 26, neuron 1,442) permanently clamped with an output of 145.