Table of Contents
Fetching ...

ProtoP-OD: Explainable Object Detection with Prototypical Parts

Pavlos Rath-Manakidis, Frederik Strothmann, Tobias Glasmachers, Laurenz Wiskott

TL;DR

ProtoP-OD adds a prototype neck that converts backbone features into prototypical parts, producing readable prototype maps that illuminate what the detector perceives. By introducing an alignment loss and allowing either Softmax, Sparsemax, or Argmax encoding, the method enforces class-aligned, sparse, and interpretable activations that can be visualized alongside attention. Experiments on COCO show a modest drop in mAP but improved explainability metrics and meaningful qualitative visualizations via multi-prototype and product maps. The work advances explainable object detection by providing causally relevant, human-readable internal representations and a framework for exploring prototype-based explanations in detection transformers.

Abstract

Interpretation and visualization of the behavior of detection transformers tends to highlight the locations in the image that the model attends to, but it provides limited insight into the \emph{semantics} that the model is focusing on. This paper introduces an extension to detection transformers that constructs prototypical local features and uses them in object detection. These custom features, which we call prototypical parts, are designed to be mutually exclusive and align with the classifications of the model. The proposed extension consists of a bottleneck module, the prototype neck, that computes a discretized representation of prototype activations and a new loss term that matches prototypes to object classes. This setup leads to interpretable representations in the prototype neck, allowing visual inspection of the image content perceived by the model and a better understanding of the model's reliability. We show experimentally that our method incurs only a limited performance penalty, and we provide examples that demonstrate the quality of the explanations provided by our method, which we argue outweighs the performance penalty.

ProtoP-OD: Explainable Object Detection with Prototypical Parts

TL;DR

ProtoP-OD adds a prototype neck that converts backbone features into prototypical parts, producing readable prototype maps that illuminate what the detector perceives. By introducing an alignment loss and allowing either Softmax, Sparsemax, or Argmax encoding, the method enforces class-aligned, sparse, and interpretable activations that can be visualized alongside attention. Experiments on COCO show a modest drop in mAP but improved explainability metrics and meaningful qualitative visualizations via multi-prototype and product maps. The work advances explainable object detection by providing causally relevant, human-readable internal representations and a framework for exploring prototype-based explanations in detection transformers.

Abstract

Interpretation and visualization of the behavior of detection transformers tends to highlight the locations in the image that the model attends to, but it provides limited insight into the \emph{semantics} that the model is focusing on. This paper introduces an extension to detection transformers that constructs prototypical local features and uses them in object detection. These custom features, which we call prototypical parts, are designed to be mutually exclusive and align with the classifications of the model. The proposed extension consists of a bottleneck module, the prototype neck, that computes a discretized representation of prototype activations and a new loss term that matches prototypes to object classes. This setup leads to interpretable representations in the prototype neck, allowing visual inspection of the image content perceived by the model and a better understanding of the model's reliability. We show experimentally that our method incurs only a limited performance penalty, and we provide examples that demonstrate the quality of the explanations provided by our method, which we argue outweighs the performance penalty.
Paper Structure (36 sections, 6 equations, 9 figures, 6 tables)

This paper contains 36 sections, 6 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Detection transformer model with prototype neck. The prototype neck transforms backbone features into readable prototype maps that are subsequently used for OD. The Figure is adapted from Figure 1 in carion_DETR. The yellow block marks our modification to the design. See Figure \ref{['fig:neck_overview']} for details.
  • Figure 2: A depiction of the most activate prototypes in an image. Each prototype is represented by a separate color. Most prototypes used to represent the children and the space around them are assigned to the corresponding person class. The prototype in blue focuses on the body of persons, orange on the space between persons, green on heads, and purple on ties. Cyan represents areas where other prototypes dominate. This prototype map has been obtained with the model large described in Section \ref{['sec:experimental_setup']}.
  • Figure 3: Structure of the prototype neck. The image representation from the backbone is processed into features that encode the interpretable prototype maps. Rounded rectangles represent intermediate representations and normal rectangles operations.
  • Figure 4: Multi-prototype map with legend. Each image location is colored according to the prototypes most active in it, with cyan representing all prototypes that are not colored individually. The semantics of the displayed prototypes also depend on the scale and view of the objects. Background areas are also assigned to prototypes. The map is from model large.
  • Figure 5: Product map example. Figure \ref{['subfig:global_prot_map']} shows the image-wide multi-prototype map. Some prototypes relevant to the detection of the person are not shown separately. Instead, they are shown in cyan, which represents all prototypes that are not separately colored. Focusing the map on the areas attended to for the detection leads to the product map in Figure \ref{['subfig:product_map']}, which shows all relevant prototypes separately. The maps are from model large.
  • ...and 4 more figures