Table of Contents
Fetching ...

Position: Do Not Explain Vision Models Without Context

Paulina Tomaszewska, Przemysław Biecek

TL;DR

Current vision-model explanations rely on region-focused heatmaps that fail when labels depend on spatial relationships. The authors develop a spatial-context framework, survey DL architectures and pretraining strategies that capture context, analyze XAI failures on spatial tasks, and propose research directions including benchmarks, measures, and diverse explanation modalities. Key contributions include a taxonomy of semantic, spatial, and scale context; demonstration of the limitations of popular XAI methods; and a concrete call for spatial XAI benchmarks and probing techniques to assess and improve explanations. The work aims to improve safety and reliability in critical applications (e.g., autonomous driving, healthcare, surveillance) by aligning explanations with the actual spatial decision factors used by models.

Abstract

Does the stethoscope in the picture make the adjacent person a doctor or a patient? This, of course, depends on the contextual relationship of the two objects. If it's obvious, why don't explanation methods for vision models use contextual information? In this paper, we (1) review the most popular methods of explaining computer vision models by pointing out that they do not take into account context information, (2) show examples of failures of popular XAI methods, (3) provide examples of real-world use cases where spatial context plays a significant role, (4) propose new research directions that may lead to better use of context information in explaining computer vision models, (5) argue that a change in approach to explanations is needed from 'where' to 'how'.

Position: Do Not Explain Vision Models Without Context

TL;DR

Current vision-model explanations rely on region-focused heatmaps that fail when labels depend on spatial relationships. The authors develop a spatial-context framework, survey DL architectures and pretraining strategies that capture context, analyze XAI failures on spatial tasks, and propose research directions including benchmarks, measures, and diverse explanation modalities. Key contributions include a taxonomy of semantic, spatial, and scale context; demonstration of the limitations of popular XAI methods; and a concrete call for spatial XAI benchmarks and probing techniques to assess and improve explanations. The work aims to improve safety and reliability in critical applications (e.g., autonomous driving, healthcare, surveillance) by aligning explanations with the actual spatial decision factors used by models.

Abstract

Does the stethoscope in the picture make the adjacent person a doctor or a patient? This, of course, depends on the contextual relationship of the two objects. If it's obvious, why don't explanation methods for vision models use contextual information? In this paper, we (1) review the most popular methods of explaining computer vision models by pointing out that they do not take into account context information, (2) show examples of failures of popular XAI methods, (3) provide examples of real-world use cases where spatial context plays a significant role, (4) propose new research directions that may lead to better use of context information in explaining computer vision models, (5) argue that a change in approach to explanations is needed from 'where' to 'how'.
Paper Structure (38 sections, 6 figures)

This paper contains 38 sections, 6 figures.

Figures (6)

  • Figure 1: Example of two images consisting of the same objects but located differently within the scene (angular orientation). A DL model would correctly classify the images into separate classes. However, the common heatmap-based explanations in the two cases will highlight solely the two same triangles and would not capture the spatial relationships that are crucial factors for correct classification. Even after translating the highlighted regions within heatmaps to semantic concepts as suggested by crp (from where to what), we would simply learn that in both cases there are two triangles. Therefore, there will be no difference in explanations that should be able to point out that spatial relationships are the main model's decision factor. That is why we postulate a shift of the paradigm from where to how so that the spatial relationships of how the objects are oriented towards each other will be captured within XAI methods.
  • Figure 2: Taxomony of contextual information within images. It is an extended version of the one proposed by context_survey.
  • Figure 3: Examples of images where the ground truth labels depend on spatial relationships between objects: distance (a, e), inside/outside (b, f), order (c, g), orientation (d, h). The images were created with the assistance of DALL-E 3.
  • Figure 4: Visual IQ test as an example of the task where spatial relationship understanding is required to solve it properly by choosing one answer from the set of given (b) to fill in the input RPM (a).
  • Figure 5: In the two images, there are the same elements so a CNN most probably will classify them as a face in both cases unlike a CapsuleNet that will not classify a deformed face as a face as spatial relationships between elements are not properly preserved face.
  • ...and 1 more figures