Table of Contents
Fetching ...

Top-GAP: Integrating Size Priors in CNNs for more Interpretability, Robustness, and Bias Mitigation

Lars Nieradzik, Henrike Stephani, Janis Keuper

TL;DR

Top-GAP is introduced, a novel regularization technique that enhances the explainability and robustness of convolutional neural networks by constraining the spatial size of the learned feature representation, and directs more attention towards object pixels rather than the background.

Abstract

This paper introduces Top-GAP, a novel regularization technique that enhances the explainability and robustness of convolutional neural networks. By constraining the spatial size of the learned feature representation, our method forces the network to focus on the most salient image regions, effectively reducing background influence. Using adversarial attacks and the Effective Receptive Field, we show that Top-GAP directs more attention towards object pixels rather than the background. This leads to enhanced interpretability and robustness. We achieve over 50% robust accuracy on CIFAR-10 with PGD $ε=\frac{8}{255}$ and $20$ iterations while maintaining the original clean accuracy. Furthermore, we see increases of up to 5% accuracy against distribution shifts. Our approach also yields more precise object localization, as evidenced by up to 25% improvement in Intersection over Union (IOU) compared to methods like GradCAM and Recipro-CAM.

Top-GAP: Integrating Size Priors in CNNs for more Interpretability, Robustness, and Bias Mitigation

TL;DR

Top-GAP is introduced, a novel regularization technique that enhances the explainability and robustness of convolutional neural networks by constraining the spatial size of the learned feature representation, and directs more attention towards object pixels rather than the background.

Abstract

This paper introduces Top-GAP, a novel regularization technique that enhances the explainability and robustness of convolutional neural networks. By constraining the spatial size of the learned feature representation, our method forces the network to focus on the most salient image regions, effectively reducing background influence. Using adversarial attacks and the Effective Receptive Field, we show that Top-GAP directs more attention towards object pixels rather than the background. This leads to enhanced interpretability and robustness. We achieve over 50% robust accuracy on CIFAR-10 with PGD and iterations while maintaining the original clean accuracy. Furthermore, we see increases of up to 5% accuracy against distribution shifts. Our approach also yields more precise object localization, as evidenced by up to 25% improvement in Intersection over Union (IOU) compared to methods like GradCAM and Recipro-CAM.
Paper Structure (19 sections, 4 equations, 5 figures, 14 tables)

This paper contains 19 sections, 4 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Example images from a biological classification dataset (a) and ImageNet (b), where we limit the locations in the output feature map that the CNN can use to make predictions. Increasing the allowed pixel count leads to more pixels being highlighted in the class activation map (CAM). If the object size is not known or variable, the pixel constraint with the highest accuracy can be selected.
  • Figure 2: Example of our architecture applied to a backbone with 3 feature maps (e.g. $7\times 7$, $14\times 14$, $28\times 28$). For all convolutions except the final one, a kernel size of $3$ and $256$ filters is used. The last convolution employs a kernel size of $1$, with the number of filters set to match the number of output classes. The CAM is as large as the biggest feature map (here F3). Our pooling layer ("Top-GAP") averages the CAMs given by the last convolutional layer ("Conv") to create a vector containing the probability for each class. For the CAM, we disable "Top-GAP" and perform min-max scaling.
  • Figure 3: ERF for various locations in the output feature map. The background becomes less important using our approach. The last feature map of standard ResNet has size $7\times 7$, with our approach it has size $56\times 56$.
  • Figure 4: Each line in the graph represents a dataset+architecture combination. The x-axis shows the normalized $k$ value (e.g. $\frac{64}{56^2}$) for the constraint, while the y-axis represents the $\ell_1$ norm.
  • Figure 5: Impact of pixel constraint on CAM (Wood identification dataset nieradzik2023automating). "No constraint" denotes a standard unmodified EfficientNet-B0 model using CAM/GradCAM Selvaraju_2019. The object in the center, known as a vessel should be highlighted. Without our method, the background containing fibers is also highlighted.

Theorems & Definitions (7)

  • definition thmcounterdefinition: Effective receptive field
  • definition thmcounterdefinition: Global Average Pooling
  • definition thmcounterdefinition: Class Activation Map
  • definition thmcounterdefinition: GradCAM
  • definition thmcounterdefinition: Top-GAP
  • definition thmcounterdefinition: ERF distance
  • definition thmcounterdefinition: Attack distance