Table of Contents
Fetching ...

Explaining Object Detectors via Collective Contribution of Pixels

Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto

TL;DR

This paper tackles the challenge of explaining object detectors by accounting for collective pixel contributions rather than treating pixels independently. It introduces VX-CODE, a greedy, patch-based explanation method that combines Shapley values and interactions to capture both individual and joint pixel influences on bounding-box localization and class decisions, supported by the novel pi*-index for sequential coalitional analysis. Through extensive experiments on DETR and Faster R-CNN across COCO and VOC, VX-CODE achieves higher insertion/deletion AUC than state-of-the-art baselines, with notable gains as the patch group size r increases, and demonstrates robustness to bias, failure cases, and adaptation to object-level foundation models like Grounding DINO. The work provides a practical, theoretically grounded framework for faithful explanations of detectors, enabling better model debugging, bias detection, and interpretability in safety-critical settings.

Abstract

Visual explanations for object detectors are crucial for enhancing their reliability. Object detectors identify and localize instances by assessing multiple visual features collectively. When generating explanations, overlooking these collective influences in detections may lead to missing compositional cues or capturing spurious correlations. However, existing methods typically focus solely on individual pixel contributions, neglecting the collective contribution of multiple pixels. To address this limitation, we propose a game-theoretic method based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. Our method provides explanations for both bounding box localization and class determination, highlighting regions crucial for detection. Extensive experiments demonstrate that the proposed method identifies important regions more accurately than state-of-the-art methods. The code will be publicly available soon.

Explaining Object Detectors via Collective Contribution of Pixels

TL;DR

This paper tackles the challenge of explaining object detectors by accounting for collective pixel contributions rather than treating pixels independently. It introduces VX-CODE, a greedy, patch-based explanation method that combines Shapley values and interactions to capture both individual and joint pixel influences on bounding-box localization and class decisions, supported by the novel pi*-index for sequential coalitional analysis. Through extensive experiments on DETR and Faster R-CNN across COCO and VOC, VX-CODE achieves higher insertion/deletion AUC than state-of-the-art baselines, with notable gains as the patch group size r increases, and demonstrates robustness to bias, failure cases, and adaptation to object-level foundation models like Grounding DINO. The work provides a practical, theoretically grounded framework for faithful explanations of detectors, enabling better model debugging, bias detection, and interpretability in safety-critical settings.

Abstract

Visual explanations for object detectors are crucial for enhancing their reliability. Object detectors identify and localize instances by assessing multiple visual features collectively. When generating explanations, overlooking these collective influences in detections may lead to missing compositional cues or capturing spurious correlations. However, existing methods typically focus solely on individual pixel contributions, neglecting the collective contribution of multiple pixels. To address this limitation, we propose a game-theoretic method based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. Our method provides explanations for both bounding box localization and class determination, highlighting regions crucial for detection. Extensive experiments demonstrate that the proposed method identifies important regions more accurately than state-of-the-art methods. The code will be publicly available soon.

Paper Structure

This paper contains 51 sections, 37 equations, 21 figures, 11 tables, 3 algorithms.

Figures (21)

  • Figure 1: Generated heat maps by ODAM, SSGrad-CAM++, and VX-CODE. Each heat map shows the insertion AUC (Ins AUC), which measures faithfulness (higher is better). While previous methods highlight dominant features (e.g., hand or surfboard), VX-CODE captures collective contributions across multiple features (e.g., leg+head or surfboard+sea) through interactions.
  • Figure 2: Comparison of visualizations generated by existing methods and VX-CODE with patch insertion ($r=1$) for an aeroplane detected using the biased model.
  • Figure 3: Comparison of visualizations generated by existing methods and VX-CODE with patch insertion ($r=1$) for failure cases. In mislocalization, the green and red boxes indicate the ground truth and the prediction, respectively.
  • Figure 4: Overview of VX-CODE. The input image is divided into patches, and $r$ patches are selected at each step. This selection process considers not only individual contributions but also collective contributions through interaction. This figure illustrates the case of $r=1$, with step $k=1$ and $k=2$ in patch insertion. Gray-bordered patches represent the previously selected patches $\mathcal{B}_{k-1}$, while yellow-bordered patches represent newly selected patches $B_{k}$. See App. \ref{['sec:pseudo_code']} for detailed algorithms.
  • Figure 5: The insertion and deletion curves for DETR on (a) MS-COCO and (b) PASCAL VOC.
  • ...and 16 more figures

Theorems & Definitions (8)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7
  • Definition 8