A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models
Zihao Lin, Samyadeep Basu, Mohammad Beigi, Varun Manjunatha, Ryan A. Rossi, Zichao Wang, Yufan Zhou, Sriram Balasubramanian, Arman Zarei, Keivan Rezaei, Ying Shen, Barry Menglong Yao, Zhiyang Xu, Qin Liu, Yuxiang Zhang, Yan Sun, Shilong Liu, Li Shen, Hongxuan Li, Soheil Feizi, Lifu Huang
TL;DR
This survey addresses mechanistic interpretability for multimodal foundation models (MMFMs), highlighting gaps relative to unimodal LLM interpretability. It introduces a three-dimensional taxonomy (Model Family, Interpretability Techniques, Applications) and analyzes how LLM-based methods transfer to MMFMs, as well as novel multimodal-specific approaches. Key contributions include synthesizing methods across non-generative VLMs, text-to-image diffusion models, and generative VLMs, mapping insights to downstream tasks such as hallucination mitigation, model editing, safety, and compositionality. The work concludes with open challenges and a roadmap for developing benchmarks and tools to advance robust, interpretable multimodal systems.
Abstract
The rise of foundation models has transformed machine learning research, prompting efforts to uncover their inner workings and develop more efficient and reliable applications for better control. While significant progress has been made in interpreting Large Language Models (LLMs), multimodal foundation models (MMFMs) - such as contrastive vision-language models, generative vision-language models, and text-to-image models - pose unique interpretability challenges beyond unimodal frameworks. Despite initial studies, a substantial gap remains between the interpretability of LLMs and MMFMs. This survey explores two key aspects: (1) the adaptation of LLM interpretability methods to multimodal models and (2) understanding the mechanistic differences between unimodal language models and crossmodal systems. By systematically reviewing current MMFM analysis techniques, we propose a structured taxonomy of interpretability methods, compare insights across unimodal and multimodal architectures, and highlight critical research gaps.
