Revisiting FunnyBirds evaluation framework for prototypical parts networks

Szymon Opłatek; Dawid Rymarczyk; Bartosz Zieliński

Revisiting FunnyBirds evaluation framework for prototypical parts networks

Szymon Opłatek, Dawid Rymarczyk, Bartosz Zieliński

TL;DR

This study comprehensively compare metric scores obtained for two types of ProtoPNet visualizations: bounding boxes and similarity maps and indicates that employing similarity maps aligns better with the essence of ProtoPNet, as evidenced by different metric scores obtained from FunnyBirds.

Abstract

Prototypical parts networks, such as ProtoPNet, became popular due to their potential to produce more genuine explanations than post-hoc methods. However, for a long time, this potential has been strictly theoretical, and no systematic studies have existed to support it. That changed recently with the introduction of the FunnyBirds benchmark, which includes metrics for evaluating different aspects of explanations. However, this benchmark employs attribution maps visualization for all explanation techniques except for the ProtoPNet, for which the bounding boxes are used. This choice significantly influences the metric scores and questions the conclusions stated in FunnyBirds publication. In this study, we comprehensively compare metric scores obtained for two types of ProtoPNet visualizations: bounding boxes and similarity maps. Our analysis indicates that employing similarity maps aligns better with the essence of ProtoPNet, as evidenced by different metric scores obtained from FunnyBirds. Therefore, we advocate using similarity maps as a visualization technique for prototypical parts networks in explainability evaluation benchmarks.

Revisiting FunnyBirds evaluation framework for prototypical parts networks

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 5 figures, 4 tables)

This paper contains 17 sections, 1 equation, 5 figures, 4 tables.

Introduction
Related works
Evaluation of xAI.
Methods
FunnyBirds
Dataset.
Interface functions.
Default interface functions for prototypical parts.
Metrics.
Summed Similarity Maps (SSM) for more precise interface functions
Experimental setup
Results
Metrics scores for attribution maps based on bounding boxes or similarity maps
Various backbones of ProtoPNet
Conclusions
...and 2 more sections

Figures (5)

Figure 1: Attribution Maps (AM) based on bounding boxes and similarity maps for two prototypical parts of ProtoPNet trained on the FunnyBirds dataset. For prototypical part 2, both AM types correctly cover the tail prototype. However, for prototypical part 1, AM based on bounding boxes incorrectly covers almost the whole area of the bird. Such discrepancy results in incorrect values of interface function PI (e.g. $333.04$ instead of $0$ for eyes) and inaccurate values of FunnyBirds metrics (see Section \ref{['sec:method']}).
Figure 2: Calculating Summed Similarity Maps (SSM) and interface functions $PI(\cdot)$ and $P(\cdot)$. The process starts with generating SSM by summarizing the similarity maps obtained for prototypical parts. Then, for each bird part, we multiply its mask with SMM and sum it up to obtain part importance (e.g. part importance for beak equals 488). To obtain important parts $P(\cdot)$, we analyze which of them has importance higher than the considered threshold $t$ (e.g., eyes are not in $P(\cdot)$ because their importance 74 is smaller than the threshold). For this example, $PI=\{beak: 488, eyes: 74, legs: 371, \dots\}$ and $P=\{beak, legs, wings\}$.
Figure 3: Two sample images (top part), their attribution maps generated based on bounding boxes (BB) or similarity maps (SSM), and corresponding SD scores. As defined in \ref{['eq:sd']}, SD is computed as correlation between orders of $\text{PI}(e)$ (GT) and $f(x) - \{f(x_{\setminus p})\}_{p}$ (BB or SSM). We observe that a more precise SSM attribution map demonstrates that ProtoPNet is much more correct than reported in hesse2023funnybirds.
Figure 4: Two sample images (top part), their attribution maps generated based on bounding boxes (BB) or similarity maps (SSM), and important parts ($P$) obtained for various values of $t$. We observe that the original BB approach tends to overidentify parts as important, which results in an incorrectly high completeness score. In contrast, our SSM alternative generates more reliable $P$. Notice that the GT row corresponds to the sets of truly important parts, and the completeness is high if $P$ is similar to one of those sets.
Figure 5: Sample image (a) and SSMs obtained for ProtoPNet with various backbones (b-d).

Revisiting FunnyBirds evaluation framework for prototypical parts networks

TL;DR

Abstract

Revisiting FunnyBirds evaluation framework for prototypical parts networks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)