Table of Contents
Fetching ...

Finding and Editing Multi-Modal Neurons in Pre-Trained Transformers

Haowen Pan, Yixin Cao, Xiaozhi Wang, Xun Yang, Meng Wang

TL;DR

This work tackles interpretability in multi-modal LLMs by identifying key vision-language neurons within Transformer FFNs using a gradient-free contribution score and linking neuron activations to cross-modal concepts. It introduces a targeted knowledge-editing scheme that modifies selected FFN weight rows to swap a source token for a target token without full-model retraining, enabling efficient, causal control over outputs. The authors propose four quantitative metrics—semantic sensitivity, region invariance, cross-image invariance, and specificity—and validate the approach across three vision-language models on a large image-caption dataset, revealing that multi-modal neurons tend to reside in higher layers and exhibit robust, concept-specific, and causally impactful behavior. The results offer a pathway to more explainable and controllable multi-modal LLMs with practical implications for reducing hallucinations and biases through targeted editing and interpretability analysis.

Abstract

Understanding the internal mechanisms by which multi-modal large language models (LLMs) interpret different modalities and integrate cross-modal representations is becoming increasingly critical for continuous improvements in both academia and industry. In this paper, we propose a novel method to identify key neurons for interpretability -- how multi-modal LLMs bridge visual and textual concepts for captioning. Our method improves conventional works upon efficiency and applied range by removing needs of costly gradient computation. Based on those identified neurons, we further design a multi-modal knowledge editing method, beneficial to mitigate sensitive words or hallucination. For rationale of our design, we provide theoretical assumption. For empirical evaluation, we have conducted extensive quantitative and qualitative experiments. The results not only validate the effectiveness of our methods, but also offer insightful findings that highlight three key properties of multi-modal neurons: sensitivity, specificity and causal-effect, to shed light for future research.

Finding and Editing Multi-Modal Neurons in Pre-Trained Transformers

TL;DR

This work tackles interpretability in multi-modal LLMs by identifying key vision-language neurons within Transformer FFNs using a gradient-free contribution score and linking neuron activations to cross-modal concepts. It introduces a targeted knowledge-editing scheme that modifies selected FFN weight rows to swap a source token for a target token without full-model retraining, enabling efficient, causal control over outputs. The authors propose four quantitative metrics—semantic sensitivity, region invariance, cross-image invariance, and specificity—and validate the approach across three vision-language models on a large image-caption dataset, revealing that multi-modal neurons tend to reside in higher layers and exhibit robust, concept-specific, and causally impactful behavior. The results offer a pathway to more explainable and controllable multi-modal LLMs with practical implications for reducing hallucinations and biases through targeted editing and interpretability analysis.

Abstract

Understanding the internal mechanisms by which multi-modal large language models (LLMs) interpret different modalities and integrate cross-modal representations is becoming increasingly critical for continuous improvements in both academia and industry. In this paper, we propose a novel method to identify key neurons for interpretability -- how multi-modal LLMs bridge visual and textual concepts for captioning. Our method improves conventional works upon efficiency and applied range by removing needs of costly gradient computation. Based on those identified neurons, we further design a multi-modal knowledge editing method, beneficial to mitigate sensitive words or hallucination. For rationale of our design, we provide theoretical assumption. For empirical evaluation, we have conducted extensive quantitative and qualitative experiments. The results not only validate the effectiveness of our methods, but also offer insightful findings that highlight three key properties of multi-modal neurons: sensitivity, specificity and causal-effect, to shed light for future research.
Paper Structure (31 sections, 8 equations, 7 figures, 12 tables, 1 algorithm)

This paper contains 31 sections, 8 equations, 7 figures, 12 tables, 1 algorithm.

Figures (7)

  • Figure 1: (i) Multi-modal neurons in FFN within multi-modal LLM. We develop a method to (a) identify multi-modal neurons and confirm that they can encode specific concepts from (b) images to (c) texts and (d) causally affect model output. (ii) Architecture of layer $l$ in Transformer-based LLM.
  • Figure 2: Distribution of unique multi-modal neurons per layer, chosen by different number of neurons with top contribution scores for each image.
  • Figure 3: Ratios of the invariant neurons in top-$k$ neurons before and after shuffling. For each image, we record the mean ratio across concepts that both exist in original caption and caption generated by shuffled image patches, and then calculate means across all images.
  • Figure 4: Ratios of the common neurons in top-100 neurons. We set $N=5$ and report results of some concepts that frequently appear in sampled images.
  • Figure 5: Heatmap of the scores (after normalization) of multi-modal neurons corresponding to specific concepts when encoding different contents in an example image. The x-axis represents concepts in the given image, and y-axis represents the top-1 neuron corresponding to each concept, respectively. Darker blocks indicate higher scores, which means higher relevance.
  • ...and 2 more figures