Table of Contents
Fetching ...

A Review of Multimodal Explainable Artificial Intelligence: Past, Present and Future

Shilin Sun, Wenbin An, Feng Tian, Fang Nan, Qidong Liu, Jun Liu, Nazaraf Shah, Ping Chen

TL;DR

This paper provides a historical survey of Multimodal Explainable AI (MXAI) across four eras—traditional machine learning, deep learning, discriminative foundation models, and generative large language models—framing methods through data, model, and post-hoc explainability. It synthesizes how MXAI has evolved from feature-based, interpretable pipelines to Transformer- and LLM-driven systems, detailing datasets, evaluation metrics, and representative techniques for each era. The work highlights key contributions, including a unified taxonomy, a critical comparison with prior surveys, and guidance on future challenges such as hallucination, visual understanding, cognitive alignment, and ground-truth-free evaluation. The findings offer a structured lens to build transparent, fair, and trustworthy multimodal AI systems, with practical implications for researchers, developers, and policy makers.

Abstract

Artificial intelligence (AI) has rapidly developed through advancements in computational power and the growth of massive datasets. However, this progress has also heightened challenges in interpreting the "black-box" nature of AI models. To address these concerns, eXplainable AI (XAI) has emerged with a focus on transparency and interpretability to enhance human understanding and trust in AI decision-making processes. In the context of multimodal data fusion and complex reasoning scenarios, the proposal of Multimodal eXplainable AI (MXAI) integrates multiple modalities for prediction and explanation tasks. Meanwhile, the advent of Large Language Models (LLMs) has led to remarkable breakthroughs in natural language processing, yet their complexity has further exacerbated the issue of MXAI. To gain key insights into the development of MXAI methods and provide crucial guidance for building more transparent, fair, and trustworthy AI systems, we review the MXAI methods from a historical perspective and categorize them across four eras: traditional machine learning, deep learning, discriminative foundation models, and generative LLMs. We also review evaluation metrics and datasets used in MXAI research, concluding with a discussion of future challenges and directions. A project related to this review has been created at https://github.com/ShilinSun/mxai_review.

A Review of Multimodal Explainable Artificial Intelligence: Past, Present and Future

TL;DR

This paper provides a historical survey of Multimodal Explainable AI (MXAI) across four eras—traditional machine learning, deep learning, discriminative foundation models, and generative large language models—framing methods through data, model, and post-hoc explainability. It synthesizes how MXAI has evolved from feature-based, interpretable pipelines to Transformer- and LLM-driven systems, detailing datasets, evaluation metrics, and representative techniques for each era. The work highlights key contributions, including a unified taxonomy, a critical comparison with prior surveys, and guidance on future challenges such as hallucination, visual understanding, cognitive alignment, and ground-truth-free evaluation. The findings offer a structured lens to build transparent, fair, and trustworthy multimodal AI systems, with practical implications for researchers, developers, and policy makers.

Abstract

Artificial intelligence (AI) has rapidly developed through advancements in computational power and the growth of massive datasets. However, this progress has also heightened challenges in interpreting the "black-box" nature of AI models. To address these concerns, eXplainable AI (XAI) has emerged with a focus on transparency and interpretability to enhance human understanding and trust in AI decision-making processes. In the context of multimodal data fusion and complex reasoning scenarios, the proposal of Multimodal eXplainable AI (MXAI) integrates multiple modalities for prediction and explanation tasks. Meanwhile, the advent of Large Language Models (LLMs) has led to remarkable breakthroughs in natural language processing, yet their complexity has further exacerbated the issue of MXAI. To gain key insights into the development of MXAI methods and provide crucial guidance for building more transparent, fair, and trustworthy AI systems, we review the MXAI methods from a historical perspective and categorize them across four eras: traditional machine learning, deep learning, discriminative foundation models, and generative LLMs. We also review evaluation metrics and datasets used in MXAI research, concluding with a discussion of future challenges and directions. A project related to this review has been created at https://github.com/ShilinSun/mxai_review.

Paper Structure

This paper contains 51 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Illustrative diagram of Multimodal Explainable Artificial Intelligence.
  • Figure 2: As AI advances, increasing computational power has led to larger model parameter counts and a growing number of multimodal models.