Finding and Editing Multi-Modal Neurons in Pre-Trained Transformers
Haowen Pan, Yixin Cao, Xiaozhi Wang, Xun Yang, Meng Wang
TL;DR
This work tackles interpretability in multi-modal LLMs by identifying key vision-language neurons within Transformer FFNs using a gradient-free contribution score and linking neuron activations to cross-modal concepts. It introduces a targeted knowledge-editing scheme that modifies selected FFN weight rows to swap a source token for a target token without full-model retraining, enabling efficient, causal control over outputs. The authors propose four quantitative metrics—semantic sensitivity, region invariance, cross-image invariance, and specificity—and validate the approach across three vision-language models on a large image-caption dataset, revealing that multi-modal neurons tend to reside in higher layers and exhibit robust, concept-specific, and causally impactful behavior. The results offer a pathway to more explainable and controllable multi-modal LLMs with practical implications for reducing hallucinations and biases through targeted editing and interpretability analysis.
Abstract
Understanding the internal mechanisms by which multi-modal large language models (LLMs) interpret different modalities and integrate cross-modal representations is becoming increasingly critical for continuous improvements in both academia and industry. In this paper, we propose a novel method to identify key neurons for interpretability -- how multi-modal LLMs bridge visual and textual concepts for captioning. Our method improves conventional works upon efficiency and applied range by removing needs of costly gradient computation. Based on those identified neurons, we further design a multi-modal knowledge editing method, beneficial to mitigate sensitive words or hallucination. For rationale of our design, we provide theoretical assumption. For empirical evaluation, we have conducted extensive quantitative and qualitative experiments. The results not only validate the effectiveness of our methods, but also offer insightful findings that highlight three key properties of multi-modal neurons: sensitivity, specificity and causal-effect, to shed light for future research.
