Table of Contents
Fetching ...

Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

Yunkai Dang, Kaichen Huang, Jiahao Huo, Yibo Yan, Sirui Huang, Dongrui Liu, Mengxi Gao, Jie Zhang, Chen Qian, Kun Wang, Yong Liu, Jing Shao, Hui Xiong, Xuming Hu

TL;DR

Multimodal large language models (MLLMs) enable cross-modal reasoning but raise transparency concerns. This survey introduces a three-perspective framework—Data, Model, and Training & Inference—to organize explainability methods, then systematically reviews token-, embedding-, neuron-, layer-, and architecture-level approaches. It contrasts data preprocessing, representation learning, and training/inference strategies, highlighting strengths, limitations, and key open challenges, and outlines concrete directions for future work. The work provides a foundational resource to guide the development of more transparent, robust, and trustworthy multimodal AI systems across diverse domains.

Abstract

The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with large language models (LLMs) and computer vision (CV) systems driving advancements in natural language understanding and visual processing, respectively. The convergence of these technologies has catalyzed the rise of multimodal AI, enabling richer, cross-modal understanding that spans text, vision, audio, and video modalities. Multimodal large language models (MLLMs), in particular, have emerged as a powerful framework, demonstrating impressive capabilities in tasks like image-text generation, visual question answering, and cross-modal retrieval. Despite these advancements, the complexity and scale of MLLMs introduce significant challenges in interpretability and explainability, essential for establishing transparency, trustworthiness, and reliability in high-stakes applications. This paper provides a comprehensive survey on the interpretability and explainability of MLLMs, proposing a novel framework that categorizes existing research across three perspectives: (I) Data, (II) Model, (III) Training \& Inference. We systematically analyze interpretability from token-level to embedding-level representations, assess approaches related to both architecture analysis and design, and explore training and inference strategies that enhance transparency. By comparing various methodologies, we identify their strengths and limitations and propose future research directions to address unresolved challenges in multimodal explainability. This survey offers a foundational resource for advancing interpretability and transparency in MLLMs, guiding researchers and practitioners toward developing more accountable and robust multimodal AI systems.

Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

TL;DR

Multimodal large language models (MLLMs) enable cross-modal reasoning but raise transparency concerns. This survey introduces a three-perspective framework—Data, Model, and Training & Inference—to organize explainability methods, then systematically reviews token-, embedding-, neuron-, layer-, and architecture-level approaches. It contrasts data preprocessing, representation learning, and training/inference strategies, highlighting strengths, limitations, and key open challenges, and outlines concrete directions for future work. The work provides a foundational resource to guide the development of more transparent, robust, and trustworthy multimodal AI systems across diverse domains.

Abstract

The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with large language models (LLMs) and computer vision (CV) systems driving advancements in natural language understanding and visual processing, respectively. The convergence of these technologies has catalyzed the rise of multimodal AI, enabling richer, cross-modal understanding that spans text, vision, audio, and video modalities. Multimodal large language models (MLLMs), in particular, have emerged as a powerful framework, demonstrating impressive capabilities in tasks like image-text generation, visual question answering, and cross-modal retrieval. Despite these advancements, the complexity and scale of MLLMs introduce significant challenges in interpretability and explainability, essential for establishing transparency, trustworthiness, and reliability in high-stakes applications. This paper provides a comprehensive survey on the interpretability and explainability of MLLMs, proposing a novel framework that categorizes existing research across three perspectives: (I) Data, (II) Model, (III) Training \& Inference. We systematically analyze interpretability from token-level to embedding-level representations, assess approaches related to both architecture analysis and design, and explore training and inference strategies that enhance transparency. By comparing various methodologies, we identify their strengths and limitations and propose future research directions to address unresolved challenges in multimodal explainability. This survey offers a foundational resource for advancing interpretability and transparency in MLLMs, guiding researchers and practitioners toward developing more accountable and robust multimodal AI systems.

Paper Structure

This paper contains 36 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The conceptual framework of this survey. MLLMs handle inputs and outputs that span multiple modalities, such as images, text, video, and audio. We explore interpretability and explainability along three major dimensions: the data, the model, and the training & inference.
  • Figure 2: We classify MLLM explainability into three main categories: Data, Model, and Training & Inference. This structure facilitates a comprehensive overview of the various techniques used to explain MLLMs, along with a discussion of the methods for evaluating these explanations across different paradigms.
  • Figure 3: Overview of our framework. The framework illustrates how input modalities like images, videos, or audio are tokenized into visual or textual tokens and then transformed into embeddings. The architecture includes individual neurons and neuron groups across layers, analyzed through architecture analysis and design. The workflow concludes with training and inference phases.
  • Figure 4: Illustration of three key methodologies for embedding interpretability. Probing-based Interpretation: Evaluates representation quality by training a probing model; its performance reflects the utility of the representations for specific tasks. Attribution-based Interpretation: Assesses input contributions to model outputs using metrics like attention scores or gradients. Decomposition-based Interpretation: Analyzes representations by breaking them into meaningful features, often using sparse auto-encoders or similar tools.
  • Figure 5: Architecture Analysis. We classify architecture analysis methods into three types: uni-modal, multi-modal, and interactive explanations, based on explanation modalities and control signal acceptance.
  • ...and 1 more figures