Table of Contents
Fetching ...

A Survey on Interpretability in Visual Recognition

Qiyang Wan, Chengzhi Gao, Ruiping Wang, Xilin Chen

TL;DR

This paper provides a systematic survey of XAI in visual recognition by establishing a multi-dimensional taxonomy from a human-centered perspective based on intent, object, presentation, and methodology, and explores the interpretability of Multimodal Large Language Models and practical applications.

Abstract

Visual recognition models have achieved unprecedented success in various tasks. While researchers aim to understand the underlying mechanisms of these models, the growing demand for deployment in safety-critical areas like autonomous driving and medical diagnostics has accelerated the development of eXplainable AI (XAI). Distinct from generic XAI, visual recognition XAI is positioned at the intersection of vision and language, which represent the two most fundamental human modalities and form the cornerstones of multimodal intelligence. This paper provides a systematic survey of XAI in visual recognition by establishing a multi-dimensional taxonomy from a human-centered perspective based on intent, object, presentation, and methodology. Beyond categorization, we summarize critical evaluation desiderata and metrics, conducting an extensive qualitative assessment across different categories and demonstrating quantitative benchmarks within specific dimensions. Furthermore, we explore the interpretability of Multimodal Large Language Models and practical applications, identifying emerging trends and opportunities. By synthesizing these diverse perspectives, this survey provides an insightful roadmap to inspire future research on the interpretability of visual recognition models.

A Survey on Interpretability in Visual Recognition

TL;DR

This paper provides a systematic survey of XAI in visual recognition by establishing a multi-dimensional taxonomy from a human-centered perspective based on intent, object, presentation, and methodology, and explores the interpretability of Multimodal Large Language Models and practical applications.

Abstract

Visual recognition models have achieved unprecedented success in various tasks. While researchers aim to understand the underlying mechanisms of these models, the growing demand for deployment in safety-critical areas like autonomous driving and medical diagnostics has accelerated the development of eXplainable AI (XAI). Distinct from generic XAI, visual recognition XAI is positioned at the intersection of vision and language, which represent the two most fundamental human modalities and form the cornerstones of multimodal intelligence. This paper provides a systematic survey of XAI in visual recognition by establishing a multi-dimensional taxonomy from a human-centered perspective based on intent, object, presentation, and methodology. Beyond categorization, we summarize critical evaluation desiderata and metrics, conducting an extensive qualitative assessment across different categories and demonstrating quantitative benchmarks within specific dimensions. Furthermore, we explore the interpretability of Multimodal Large Language Models and practical applications, identifying emerging trends and opportunities. By synthesizing these diverse perspectives, this survey provides an insightful roadmap to inspire future research on the interpretability of visual recognition models.

Paper Structure

This paper contains 42 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview of XAI in visual recognition. While black-box models provide only raw predictions, interpretability research offers diverse explanations to enhance human trust. The taxonomy proposed in this survey groups existing methods along four dimensions: intent, object, presentation, and methodology.
  • Figure 2: Structural organization of the survey. The primary contribution of this paper lies in proposing a multi-dimensional taxonomy to describe the interpretability of visual recognition across intent, object, presentation, and methodology. As detailed in Sec. \ref{['sec:taxonomy']}, this framework categorizes existing methods from a human centered perspective to enhance systematic understanding. Beyond this taxonomy, the survey encompasses evaluation metrics, multimodal large language models, and practical applications to provide an insightful research roadmap.
  • Figure 3: The proposed taxonomy and corresponding method groups of XAI in visual recognition.
  • Figure 4: Illustration of Object. XAI methods can be categorized as local or global, depending on whether the explanation module receives a single sample or the entire model as input. Specifically, in the context of visual recognition, it is also important to consider the model’s representations of categories, concepts, and other high-level semantic labels, which may be viewed as semi-local explanations.
  • Figure 5: Illustration of Presentation. Some representative examples for scalarkim2018interpretability, attentionselvaraju2017grad, structured representationnauta2021neural, semantic unityeh2020completeness, and exemplarchen2019looks are presented respectively.
  • ...and 3 more figures