Multimodal Explainable Artificial Intelligence: A Comprehensive Review of Methodological Advances and Future Research Directions
Nikolaos Rodis, Christos Sardianos, Panagiotis Radoglou-Grammatikis, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos
TL;DR
This paper surveys Multimodal Explainable AI (MXAI), outlining how explanations can span multiple data modalities beyond traditional unimodal XAI. It builds a systematic registry of MXAI prediction tasks and datasets, and introduces a three-dimensional framework that categorizes methods by the number of modalities, the stage at which explanations are generated, and the underlying methodology. The authors review how unimodal XAI techniques extend to multimodal scenarios, present a detailed taxonomy of MXAI approaches (UU/UM/MU/MM) and explanation stages (Intrinsic/Post-hoc/Separate module), and summarize evaluation metrics for textual, visual, and multimodal explanations. They also discuss current challenges—such as formal definitions, attention reliability, and evaluation standards—and propose future directions, including causal explanations, bias mitigation, and broader modality integration to advance practical, trustworthy MXAI systems.
Abstract
Despite the fact that Artificial Intelligence (AI) has boosted the achievement of remarkable results across numerous data analysis tasks, however, this is typically accompanied by a significant shortcoming in the exhibited transparency and trustworthiness of the developed systems. In order to address the latter challenge, the so-called eXplainable AI (XAI) research field has emerged, which aims, among others, at estimating meaningful explanations regarding the employed model reasoning process. The current study focuses on systematically analyzing the recent advances in the area of Multimodal XAI (MXAI), which comprises methods that involve multiple modalities in the primary prediction and explanation tasks. In particular, the relevant AI-boosted prediction tasks and publicly available datasets used for learning/evaluating explanations in multimodal scenarios are initially described. Subsequently, a systematic and comprehensive analysis of the MXAI methods of the literature is provided, taking into account the following key criteria: a) The number of the involved modalities (in the employed AI module), b) The processing stage at which explanations are generated, and c) The type of the adopted methodology (i.e. the actual mechanism and mathematical formalization) for producing explanations. Then, a thorough analysis of the metrics used for MXAI methods evaluation is performed. Finally, an extensive discussion regarding the current challenges and future research directions is provided.
