Table of Contents
Fetching ...

The Mysterious Case of Neuron 1512: Injectable Realignment Architectures Reveal Internal Characteristics of Meta's Llama 2 Model

Brenden Smith, Dallin Baker, Clayton Chase, Myles Barney, Kaden Parker, Makenna Allred, Peter Hu, Alex Evans, Nancy Fulda

TL;DR

This work introduces the Injectable Realignment Model (IRM), a compact neural network that modulates a frozen Llama-2-7B-chat by injecting activation perturbations during inference to study interpretability and alignment. Through three emotion-based alignment datasets derived from SQuAD2.0 prompts, IRM induces observable alignment behaviors, enabling analysis of host-model activations. A striking finding is the vertical continuity of alignment signals across transformer blocks, dominated by a single neuron index (1512) whose influence persists across runs and prompts, likely due to the residual architecture. These results highlight a potential weak point in Llama-2 and motivate re-examining the language modeling head's role, while demonstrating IRM as a valuable tool for interpretability and targeted alignment research.

Abstract

Large Language Models (LLMs) have an unrivaled and invaluable ability to "align" their output to a diverse range of human preferences, by mirroring them in the text they generate. The internal characteristics of such models, however, remain largely opaque. This work presents the Injectable Realignment Model (IRM) as a novel approach to language model interpretability and explainability. Inspired by earlier work on Neural Programming Interfaces, we construct and train a small network -- the IRM -- to induce emotion-based alignments within a 7B parameter LLM architecture. The IRM outputs are injected via layerwise addition at various points during the LLM's forward pass, thus modulating its behavior without changing the weights of the original model. This isolates the alignment behavior from the complex mechanisms of the transformer model. Analysis of the trained IRM's outputs reveals a curious pattern. Across more than 24 training runs and multiple alignment datasets, patterns of IRM activations align themselves in striations associated with a neuron's index within each transformer layer, rather than being associated with the layers themselves. Further, a single neuron index (1512) is strongly correlated with all tested alignments. This result, although initially counterintuitive, is directly attributable to design choices present within almost all commercially available transformer architectures, and highlights a potential weak point in Meta's pretrained Llama 2 models. It also demonstrates the value of the IRM architecture for language model analysis and interpretability. Our code and datasets are available at https://github.com/DRAGNLabs/injectable-alignment-model

The Mysterious Case of Neuron 1512: Injectable Realignment Architectures Reveal Internal Characteristics of Meta's Llama 2 Model

TL;DR

This work introduces the Injectable Realignment Model (IRM), a compact neural network that modulates a frozen Llama-2-7B-chat by injecting activation perturbations during inference to study interpretability and alignment. Through three emotion-based alignment datasets derived from SQuAD2.0 prompts, IRM induces observable alignment behaviors, enabling analysis of host-model activations. A striking finding is the vertical continuity of alignment signals across transformer blocks, dominated by a single neuron index (1512) whose influence persists across runs and prompts, likely due to the residual architecture. These results highlight a potential weak point in Llama-2 and motivate re-examining the language modeling head's role, while demonstrating IRM as a valuable tool for interpretability and targeted alignment research.

Abstract

Large Language Models (LLMs) have an unrivaled and invaluable ability to "align" their output to a diverse range of human preferences, by mirroring them in the text they generate. The internal characteristics of such models, however, remain largely opaque. This work presents the Injectable Realignment Model (IRM) as a novel approach to language model interpretability and explainability. Inspired by earlier work on Neural Programming Interfaces, we construct and train a small network -- the IRM -- to induce emotion-based alignments within a 7B parameter LLM architecture. The IRM outputs are injected via layerwise addition at various points during the LLM's forward pass, thus modulating its behavior without changing the weights of the original model. This isolates the alignment behavior from the complex mechanisms of the transformer model. Analysis of the trained IRM's outputs reveals a curious pattern. Across more than 24 training runs and multiple alignment datasets, patterns of IRM activations align themselves in striations associated with a neuron's index within each transformer layer, rather than being associated with the layers themselves. Further, a single neuron index (1512) is strongly correlated with all tested alignments. This result, although initially counterintuitive, is directly attributable to design choices present within almost all commercially available transformer architectures, and highlights a potential weak point in Meta's pretrained Llama 2 models. It also demonstrates the value of the IRM architecture for language model analysis and interpretability. Our code and datasets are available at https://github.com/DRAGNLabs/injectable-alignment-model
Paper Structure (20 sections, 2 equations, 9 figures, 1 table)

This paper contains 20 sections, 2 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: The Llama-2 transformer architecture with accompanying IRM integration. The IRM receives as input the initial post-attention activations of the Llama-2 model, and produces as output a set of permutations to be summed with the post-attention activations of each transformer block. The permutations learned by the IRM provide valuable insights regarding its host model.
  • Figure 2: Top: IRM outputs for the anger dataset, trained with a random seed of 42, averaged across the input prompt tokens and first 10 generated tokens. Bottom: IRM outputs for the sadness dataset, trained with a random seed of 420, averaged across the input prompt tokens and first 10 generated tokens. The distinct vertical line at neuron index 1512 is repeatedly visible across twenty-four independent training runs, five datasets, two random seeds, and seven prompts. Further heat maps can be viewed in the Appendix.
  • Figure 3: Histograms showing at what points during inference the highest magnitude outputs were produced by IRMs of different sentiments. The histograms show that most of the 1000 largest values occur towards the beginning of inference, with very few past the midpoint.
  • Figure 4: A collection of four heat maps, representing the three training datasets and one untrained IRM, which serves as the baseline. The heat maps effectively display the points of interest within the IRM's outputs, highlighting which indices host the largest outputs. The color scale differs between each heat map to prevent the overshadowing of smaller values.
  • Figure :
  • ...and 4 more figures