Table of Contents
Fetching ...

Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering

Nicholas Pochinkov, Ben Pasero, Skylar Shibayama

TL;DR

This work tackles mechanistic interpretability of transformer attention by systematically ablating attention-head neurons and introducing peak ablation, a method that centers activations at their modal value. By comparing peak ablation against zero, mean, and activation resampling across text and vision models, the authors show that peak ablation often minimizes performance degradation, though the best approach can depend on model and regime. The study provides a unified framework for understanding how activation distributions influence pruning and causal-scrubbing analyses, and suggests peak-centered activation as a natural basis for future interpretability and sparsity research. These findings have practical implications for more robust pruning and more faithful interpretation of attention mechanisms in large-scale transformers, potentially improving reliability in real-world deployments.

Abstract

The use of transformer-based models is growing rapidly throughout society. With this growth, it is important to understand how they work, and in particular, how the attention mechanisms represent concepts. Though there are many interpretability methods, many look at models through their neuronal activations, which are poorly understood. We describe different lenses through which to view neuron activations, and investigate the effectiveness in language models and vision transformers through various methods of neural ablation: zero ablation, mean ablation, activation resampling, and a novel approach we term 'peak ablation'. Through experimental analysis, we find that in different regimes and models, each method can offer the lowest degradation of model performance compared to other methods, with resampling usually causing the most significant performance deterioration. We make our code available at https://github.com/nickypro/investigating-ablation.

Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering

TL;DR

This work tackles mechanistic interpretability of transformer attention by systematically ablating attention-head neurons and introducing peak ablation, a method that centers activations at their modal value. By comparing peak ablation against zero, mean, and activation resampling across text and vision models, the authors show that peak ablation often minimizes performance degradation, though the best approach can depend on model and regime. The study provides a unified framework for understanding how activation distributions influence pruning and causal-scrubbing analyses, and suggests peak-centered activation as a natural basis for future interpretability and sparsity research. These findings have practical implications for more robust pruning and more faithful interpretation of attention mechanisms in large-scale transformers, potentially improving reliability in real-world deployments.

Abstract

The use of transformer-based models is growing rapidly throughout society. With this growth, it is important to understand how they work, and in particular, how the attention mechanisms represent concepts. Though there are many interpretability methods, many look at models through their neuronal activations, which are poorly understood. We describe different lenses through which to view neuron activations, and investigate the effectiveness in language models and vision transformers through various methods of neural ablation: zero ablation, mean ablation, activation resampling, and a novel approach we term 'peak ablation'. Through experimental analysis, we find that in different regimes and models, each method can offer the lowest degradation of model performance compared to other methods, with resampling usually causing the most significant performance deterioration. We make our code available at https://github.com/nickypro/investigating-ablation.
Paper Structure (16 sections, 3 figures, 3 tables)

This paper contains 16 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Un-normalised probability density functions (histograms) of attention neuron activations in RoBERTa. We see in (left) an average of distributions of all neurons in a layer, (centre) a bi-modal neuron with both peaks not at zero, and (right) another example of a neuron with an atypical distribution. X-axis shows neuron value, and Y-axis shows probability of a neuron taking that value.
  • Figure 2: Change in Top1 next-token prediction accuracy (Top1) and cross-entropy loss (CE Loss) at different fractions of model pruned with different methods of ablation for Mistral 7B and OPT 1.3B
  • Figure 3: Change in Top1 next-token prediction accuracy (Top1) and cross-entropy loss (CE Loss) at different fractions of model pruned with different methods of ablation for ViT 7B and RoBERTa