Table of Contents
Fetching ...

A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models

Zihao Lin, Samyadeep Basu, Mohammad Beigi, Varun Manjunatha, Ryan A. Rossi, Zichao Wang, Yufan Zhou, Sriram Balasubramanian, Arman Zarei, Keivan Rezaei, Ying Shen, Barry Menglong Yao, Zhiyang Xu, Qin Liu, Yuxiang Zhang, Yan Sun, Shilong Liu, Li Shen, Hongxuan Li, Soheil Feizi, Lifu Huang

TL;DR

This survey addresses mechanistic interpretability for multimodal foundation models (MMFMs), highlighting gaps relative to unimodal LLM interpretability. It introduces a three-dimensional taxonomy (Model Family, Interpretability Techniques, Applications) and analyzes how LLM-based methods transfer to MMFMs, as well as novel multimodal-specific approaches. Key contributions include synthesizing methods across non-generative VLMs, text-to-image diffusion models, and generative VLMs, mapping insights to downstream tasks such as hallucination mitigation, model editing, safety, and compositionality. The work concludes with open challenges and a roadmap for developing benchmarks and tools to advance robust, interpretable multimodal systems.

Abstract

The rise of foundation models has transformed machine learning research, prompting efforts to uncover their inner workings and develop more efficient and reliable applications for better control. While significant progress has been made in interpreting Large Language Models (LLMs), multimodal foundation models (MMFMs) - such as contrastive vision-language models, generative vision-language models, and text-to-image models - pose unique interpretability challenges beyond unimodal frameworks. Despite initial studies, a substantial gap remains between the interpretability of LLMs and MMFMs. This survey explores two key aspects: (1) the adaptation of LLM interpretability methods to multimodal models and (2) understanding the mechanistic differences between unimodal language models and crossmodal systems. By systematically reviewing current MMFM analysis techniques, we propose a structured taxonomy of interpretability methods, compare insights across unimodal and multimodal architectures, and highlight critical research gaps.

A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models

TL;DR

This survey addresses mechanistic interpretability for multimodal foundation models (MMFMs), highlighting gaps relative to unimodal LLM interpretability. It introduces a three-dimensional taxonomy (Model Family, Interpretability Techniques, Applications) and analyzes how LLM-based methods transfer to MMFMs, as well as novel multimodal-specific approaches. Key contributions include synthesizing methods across non-generative VLMs, text-to-image diffusion models, and generative VLMs, mapping insights to downstream tasks such as hallucination mitigation, model editing, safety, and compositionality. The work concludes with open challenges and a roadmap for developing benchmarks and tools to advance robust, interpretable multimodal systems.

Abstract

The rise of foundation models has transformed machine learning research, prompting efforts to uncover their inner workings and develop more efficient and reliable applications for better control. While significant progress has been made in interpreting Large Language Models (LLMs), multimodal foundation models (MMFMs) - such as contrastive vision-language models, generative vision-language models, and text-to-image models - pose unique interpretability challenges beyond unimodal frameworks. Despite initial studies, a substantial gap remains between the interpretability of LLMs and MMFMs. This survey explores two key aspects: (1) the adaptation of LLM interpretability methods to multimodal models and (2) understanding the mechanistic differences between unimodal language models and crossmodal systems. By systematically reviewing current MMFM analysis techniques, we propose a structured taxonomy of interpretability methods, compare insights across unimodal and multimodal architectures, and highlight critical research gaps.

Paper Structure

This paper contains 58 sections, 4 figures, 10 tables.

Figures (4)

  • Figure 1: In our survey, we study two types of mechanistic interpretability: (1) methods that adapted from LLM interpretability techniques and (2) multimodal-specific interpretability methods. Different analysis methods are applied to three multimodal model architectures: (a) Non-generative Vision-Language Models, (b) Multimodal Large Language Models, and(c) Text-to-Image Generative Models (diffusion models especially). The interpretability insights from different methods and models can illuminate specific applications.
  • Figure 2: The illustrations of interpretability methods: (a) Linear Probing, (b) Logit Lens, and (c) Causal Tracing.
  • Figure 3: The illustrations of interpretability methods: (a) Representation Decomposition, (b) Sparse AutoEncoder, and (c) Neuron-level Analysis.
  • Figure 4: The illustration of model components. Take the transformer-based generative vision-language model as an example.