Table of Contents
Fetching ...

Identifying Multi-modal Knowledge Neurons in Pretrained Transformers via Two-stage Filtering

Yugen Sato, Tomohiro Takagi

TL;DR

This work tackles interpretability in multimodal LLMs by locating knowledge within Transformer neurons, focusing on the FFN activations that store cross-modal knowledge. It introduces a two-stage, data-driven pipeline that first uses inpainting-based activation differences $O^{\prime k}_{o} - O^{\prime k}_{i}$ to identify candidates, then applies GradCAM gradients $g_c$ to filter to a final set $N_k$ of knowledge neurons in MiniGPT-4. Through MS COCO image-caption experiments and standard metrics, the method demonstrates stronger suppression of target knowledge and better retention of other knowledge than baselines, with activation heatmaps and token-decoding confirming the semantic alignment of the identified neurons. The approach advances explainability and opens avenues for targeted knowledge editing in multimodal models, albeit with limitations related to dataset, model, and threshold choices that future work should address, including automatic thresholding and broader model generalization. $O^l = \sigma(W^l_{in}(a^l + h^{l-1}))$, $O' \in \mathbb{R}^{L \times P \times d_f}$, $g_c = O' \frac{\partial y^{c}}{\partial O'}$, and $N_k = \{(L_l, U_i) \in C^k | g_c[l, p, i] > threshold_g^k\}$ are representative notations guiding the method.

Abstract

Recent advances in large language models (LLMs) have led to the development of multimodal LLMs (MLLMs) in the fields of natural language processing (NLP) and computer vision. Although these models allow for integrated visual and language understanding, they present challenges such as opaque internal processing and the generation of hallucinations and misinformation. Therefore, there is a need for a method to clarify the location of knowledge in MLLMs. In this study, we propose a method to identify neurons associated with specific knowledge using MiniGPT-4, a Transformer-based MLLM. Specifically, we extract knowledge neurons through two stages: activation differences filtering using inpainting and gradient-based filtering using GradCAM. Experiments on the image caption generation task using the MS COCO 2017 dataset, BLEU, ROUGE, and BERTScore quantitative evaluation, and qualitative evaluation using an activation heatmap showed that our method is able to locate knowledge with higher accuracy than existing methods. This study contributes to the visualization and explainability of knowledge in MLLMs and shows the potential for future knowledge editing and control.

Identifying Multi-modal Knowledge Neurons in Pretrained Transformers via Two-stage Filtering

TL;DR

This work tackles interpretability in multimodal LLMs by locating knowledge within Transformer neurons, focusing on the FFN activations that store cross-modal knowledge. It introduces a two-stage, data-driven pipeline that first uses inpainting-based activation differences to identify candidates, then applies GradCAM gradients to filter to a final set of knowledge neurons in MiniGPT-4. Through MS COCO image-caption experiments and standard metrics, the method demonstrates stronger suppression of target knowledge and better retention of other knowledge than baselines, with activation heatmaps and token-decoding confirming the semantic alignment of the identified neurons. The approach advances explainability and opens avenues for targeted knowledge editing in multimodal models, albeit with limitations related to dataset, model, and threshold choices that future work should address, including automatic thresholding and broader model generalization. , , , and are representative notations guiding the method.

Abstract

Recent advances in large language models (LLMs) have led to the development of multimodal LLMs (MLLMs) in the fields of natural language processing (NLP) and computer vision. Although these models allow for integrated visual and language understanding, they present challenges such as opaque internal processing and the generation of hallucinations and misinformation. Therefore, there is a need for a method to clarify the location of knowledge in MLLMs. In this study, we propose a method to identify neurons associated with specific knowledge using MiniGPT-4, a Transformer-based MLLM. Specifically, we extract knowledge neurons through two stages: activation differences filtering using inpainting and gradient-based filtering using GradCAM. Experiments on the image caption generation task using the MS COCO 2017 dataset, BLEU, ROUGE, and BERTScore quantitative evaluation, and qualitative evaluation using an activation heatmap showed that our method is able to locate knowledge with higher accuracy than existing methods. This study contributes to the visualization and explainability of knowledge in MLLMs and shows the potential for future knowledge editing and control.

Paper Structure

This paper contains 25 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Original and inpainted images
  • Figure 2: Layer distribution of identified neurons
  • Figure 3: Activation Heatmap and decoding neurons
  • Figure 4: Activation heatmap and Decoding neurons
  • Figure 5: Activation heatmap and Decoding neurons