Table of Contents
Fetching ...

Multimodal Explainable Artificial Intelligence: A Comprehensive Review of Methodological Advances and Future Research Directions

Nikolaos Rodis, Christos Sardianos, Panagiotis Radoglou-Grammatikis, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos

TL;DR

This paper surveys Multimodal Explainable AI (MXAI), outlining how explanations can span multiple data modalities beyond traditional unimodal XAI. It builds a systematic registry of MXAI prediction tasks and datasets, and introduces a three-dimensional framework that categorizes methods by the number of modalities, the stage at which explanations are generated, and the underlying methodology. The authors review how unimodal XAI techniques extend to multimodal scenarios, present a detailed taxonomy of MXAI approaches (UU/UM/MU/MM) and explanation stages (Intrinsic/Post-hoc/Separate module), and summarize evaluation metrics for textual, visual, and multimodal explanations. They also discuss current challenges—such as formal definitions, attention reliability, and evaluation standards—and propose future directions, including causal explanations, bias mitigation, and broader modality integration to advance practical, trustworthy MXAI systems.

Abstract

Despite the fact that Artificial Intelligence (AI) has boosted the achievement of remarkable results across numerous data analysis tasks, however, this is typically accompanied by a significant shortcoming in the exhibited transparency and trustworthiness of the developed systems. In order to address the latter challenge, the so-called eXplainable AI (XAI) research field has emerged, which aims, among others, at estimating meaningful explanations regarding the employed model reasoning process. The current study focuses on systematically analyzing the recent advances in the area of Multimodal XAI (MXAI), which comprises methods that involve multiple modalities in the primary prediction and explanation tasks. In particular, the relevant AI-boosted prediction tasks and publicly available datasets used for learning/evaluating explanations in multimodal scenarios are initially described. Subsequently, a systematic and comprehensive analysis of the MXAI methods of the literature is provided, taking into account the following key criteria: a) The number of the involved modalities (in the employed AI module), b) The processing stage at which explanations are generated, and c) The type of the adopted methodology (i.e. the actual mechanism and mathematical formalization) for producing explanations. Then, a thorough analysis of the metrics used for MXAI methods evaluation is performed. Finally, an extensive discussion regarding the current challenges and future research directions is provided.

Multimodal Explainable Artificial Intelligence: A Comprehensive Review of Methodological Advances and Future Research Directions

TL;DR

This paper surveys Multimodal Explainable AI (MXAI), outlining how explanations can span multiple data modalities beyond traditional unimodal XAI. It builds a systematic registry of MXAI prediction tasks and datasets, and introduces a three-dimensional framework that categorizes methods by the number of modalities, the stage at which explanations are generated, and the underlying methodology. The authors review how unimodal XAI techniques extend to multimodal scenarios, present a detailed taxonomy of MXAI approaches (UU/UM/MU/MM) and explanation stages (Intrinsic/Post-hoc/Separate module), and summarize evaluation metrics for textual, visual, and multimodal explanations. They also discuss current challenges—such as formal definitions, attention reliability, and evaluation standards—and propose future directions, including causal explanations, bias mitigation, and broader modality integration to advance practical, trustworthy MXAI systems.

Abstract

Despite the fact that Artificial Intelligence (AI) has boosted the achievement of remarkable results across numerous data analysis tasks, however, this is typically accompanied by a significant shortcoming in the exhibited transparency and trustworthiness of the developed systems. In order to address the latter challenge, the so-called eXplainable AI (XAI) research field has emerged, which aims, among others, at estimating meaningful explanations regarding the employed model reasoning process. The current study focuses on systematically analyzing the recent advances in the area of Multimodal XAI (MXAI), which comprises methods that involve multiple modalities in the primary prediction and explanation tasks. In particular, the relevant AI-boosted prediction tasks and publicly available datasets used for learning/evaluating explanations in multimodal scenarios are initially described. Subsequently, a systematic and comprehensive analysis of the MXAI methods of the literature is provided, taking into account the following key criteria: a) The number of the involved modalities (in the employed AI module), b) The processing stage at which explanations are generated, and c) The type of the adopted methodology (i.e. the actual mechanism and mathematical formalization) for producing explanations. Then, a thorough analysis of the metrics used for MXAI methods evaluation is performed. Finally, an extensive discussion regarding the current challenges and future research directions is provided.
Paper Structure (23 sections, 8 figures, 3 tables)

This paper contains 23 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Difference between unimodal and multimodal XAI: a) Unimodal explanation (saliency map) for image classification selvaraju2016grad2, and b) Multimodal explanation (visual and text) zero-shot learning Liu2020.
  • Figure 2: Categorization of MXAI approaches: a) Main classes of conventional (unimodal) XAI methods that can also be adopted in MXAI analysis, b) Basic categories that can be considered only for MXAI approaches.
  • Figure 3: MXAI categories with respect to the number of the involved modalities in the primary prediction model input and the generated explanation: a) Unimodal task and unimodal explanation (UU) Hendricks2016, b) Unimodal task and multimodal explanation (UM) Liu2020, c) Multimodal task and unimodal explanation (MU) patro2018differential, and d) Multimodal task and multimodal explanation (MM) Park2018.
  • Figure 4: Conventional XAI methods' generated explanations for multimodal scenarios: a) Grad-CAM++ for image captioning Chattopadhay2018, b) LIFT-CAM for a VQA task jung2021towards, c) RISE for image captioning Petsiuk2018RISERI, and d) Guided-backpropagation for a VQA task Kim2017VisualEHadamard.
  • Figure 5: UU MXAI methods' generated explanations: a) Textual Hendricks2016, b) Textual barratt2017interpnet, c) Object relationships lu2016visual, and d) Scene graph zhuo2019explainable.
  • ...and 3 more figures