Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering
Nicholas Pochinkov, Ben Pasero, Skylar Shibayama
TL;DR
This work tackles mechanistic interpretability of transformer attention by systematically ablating attention-head neurons and introducing peak ablation, a method that centers activations at their modal value. By comparing peak ablation against zero, mean, and activation resampling across text and vision models, the authors show that peak ablation often minimizes performance degradation, though the best approach can depend on model and regime. The study provides a unified framework for understanding how activation distributions influence pruning and causal-scrubbing analyses, and suggests peak-centered activation as a natural basis for future interpretability and sparsity research. These findings have practical implications for more robust pruning and more faithful interpretation of attention mechanisms in large-scale transformers, potentially improving reliability in real-world deployments.
Abstract
The use of transformer-based models is growing rapidly throughout society. With this growth, it is important to understand how they work, and in particular, how the attention mechanisms represent concepts. Though there are many interpretability methods, many look at models through their neuronal activations, which are poorly understood. We describe different lenses through which to view neuron activations, and investigate the effectiveness in language models and vision transformers through various methods of neural ablation: zero ablation, mean ablation, activation resampling, and a novel approach we term 'peak ablation'. Through experimental analysis, we find that in different regimes and models, each method can offer the lowest degradation of model performance compared to other methods, with resampling usually causing the most significant performance deterioration. We make our code available at https://github.com/nickypro/investigating-ablation.
