Position: Do Not Explain Vision Models Without Context
Paulina Tomaszewska, Przemysław Biecek
TL;DR
Current vision-model explanations rely on region-focused heatmaps that fail when labels depend on spatial relationships. The authors develop a spatial-context framework, survey DL architectures and pretraining strategies that capture context, analyze XAI failures on spatial tasks, and propose research directions including benchmarks, measures, and diverse explanation modalities. Key contributions include a taxonomy of semantic, spatial, and scale context; demonstration of the limitations of popular XAI methods; and a concrete call for spatial XAI benchmarks and probing techniques to assess and improve explanations. The work aims to improve safety and reliability in critical applications (e.g., autonomous driving, healthcare, surveillance) by aligning explanations with the actual spatial decision factors used by models.
Abstract
Does the stethoscope in the picture make the adjacent person a doctor or a patient? This, of course, depends on the contextual relationship of the two objects. If it's obvious, why don't explanation methods for vision models use contextual information? In this paper, we (1) review the most popular methods of explaining computer vision models by pointing out that they do not take into account context information, (2) show examples of failures of popular XAI methods, (3) provide examples of real-world use cases where spatial context plays a significant role, (4) propose new research directions that may lead to better use of context information in explaining computer vision models, (5) argue that a change in approach to explanations is needed from 'where' to 'how'.
