Table of Contents
Fetching ...

Dilated Convolution with Learnable Spacings makes visual models more aligned with humans: a Grad-CAM study

Rabih Chamas, Ismail Khalfaoui-Hassani, Timothee Masquelier

TL;DR

The paper tackles the interpretability gap in visual models by integrating Dilated Convolution with Learnable Spacings (DCLS), which enlarges receptive fields without adding parameters. It evaluates interpretability through Spearman correlations between model heatmaps (Grad-CAM and the proposed Threshold-Grad-CAM) and human attention heatmaps from the ClickMe dataset across eight architectures with drop-in DCLS replacements. The study finds that DCLS generally improves interpretability, with Threshold-Grad-CAM significantly boosting explanations for architectures where Grad-CAM produced unreliable heatmaps; FastViT variants show mixed results. The work provides code and checkpoints, arguing that human-aligned visual strategies can be enhanced via learnable spacings, potentially improving trust and robustness in CV systems.

Abstract

Dilated Convolution with Learnable Spacing (DCLS) is a recent advanced convolution method that allows enlarging the receptive fields (RF) without increasing the number of parameters, like the dilated convolution, yet without imposing a regular grid. DCLS has been shown to outperform the standard and dilated convolutions on several computer vision benchmarks. Here, we show that, in addition, DCLS increases the models' interpretability, defined as the alignment with human visual strategies. To quantify it, we use the Spearman correlation between the models' GradCAM heatmaps and the ClickMe dataset heatmaps, which reflect human visual attention. We took eight reference models - ResNet50, ConvNeXt (T, S and B), CAFormer, ConvFormer, and FastViT (sa 24 and 36) - and drop-in replaced the standard convolution layers with DCLS ones. This improved the interpretability score in seven of them. Moreover, we observed that Grad-CAM generated random heatmaps for two models in our study: CAFormer and ConvFormer models, leading to low interpretability scores. We addressed this issue by introducing Threshold-Grad-CAM, a modification built on top of Grad-CAM that enhanced interpretability across nearly all models. The code and checkpoints to reproduce this study are available at: https://github.com/rabihchamas/DCLS-GradCAM-Eval.

Dilated Convolution with Learnable Spacings makes visual models more aligned with humans: a Grad-CAM study

TL;DR

The paper tackles the interpretability gap in visual models by integrating Dilated Convolution with Learnable Spacings (DCLS), which enlarges receptive fields without adding parameters. It evaluates interpretability through Spearman correlations between model heatmaps (Grad-CAM and the proposed Threshold-Grad-CAM) and human attention heatmaps from the ClickMe dataset across eight architectures with drop-in DCLS replacements. The study finds that DCLS generally improves interpretability, with Threshold-Grad-CAM significantly boosting explanations for architectures where Grad-CAM produced unreliable heatmaps; FastViT variants show mixed results. The work provides code and checkpoints, arguing that human-aligned visual strategies can be enhanced via learnable spacings, potentially improving trust and robustness in CV systems.

Abstract

Dilated Convolution with Learnable Spacing (DCLS) is a recent advanced convolution method that allows enlarging the receptive fields (RF) without increasing the number of parameters, like the dilated convolution, yet without imposing a regular grid. DCLS has been shown to outperform the standard and dilated convolutions on several computer vision benchmarks. Here, we show that, in addition, DCLS increases the models' interpretability, defined as the alignment with human visual strategies. To quantify it, we use the Spearman correlation between the models' GradCAM heatmaps and the ClickMe dataset heatmaps, which reflect human visual attention. We took eight reference models - ResNet50, ConvNeXt (T, S and B), CAFormer, ConvFormer, and FastViT (sa 24 and 36) - and drop-in replaced the standard convolution layers with DCLS ones. This improved the interpretability score in seven of them. Moreover, we observed that Grad-CAM generated random heatmaps for two models in our study: CAFormer and ConvFormer models, leading to low interpretability scores. We addressed this issue by introducing Threshold-Grad-CAM, a modification built on top of Grad-CAM that enhanced interpretability across nearly all models. The code and checkpoints to reproduce this study are available at: https://github.com/rabihchamas/DCLS-GradCAM-Eval.
Paper Structure (17 sections, 6 figures, 1 table, 2 algorithms)

This paper contains 17 sections, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: Visualization of Heatmaps on ClickMe dataset Images. First row: original images from the ClickMe dataset. Second row: the same images superimposed with heatmaps created by humans from the ClickMe project. Third row: Threshold-GradCAM heatmaps of the ConvNeXt base model enhanced with DCLS. Fourth row: Threshold-GradCAM heatmaps of the baseline ConvNeXt base model without DCLS.
  • Figure 2: Comparison of models interpretability score using Threshold-GradCAM with and without DCLS. Each point represents a different model, plotted according to its interpretability score without DCLS on the x-axis and with DCLS on the y-axis. Models above the dashed line demonstrate improved performance with the inclusion of DCLS.
  • Figure 3: Comparative analysis of interpretability scores across different models using Grad-CAM and Threshold-Grad-CAM techniques. Top: The interpretability scores with Grad-CAM. Bottom: The interpretability scores with Threshold-Grad-CAM. Both subfigures highlight the difference in scores with and without DCLS. The results indicate that DCLS generally improves interpretability scores for most models.
  • Figure 4: Correlation between model size and Interpretability for baseline models, using Threshold-Grad-CAM scores. Larger models tend to have higher interpretability scores, suggesting a positive correlation between model size and explainability in baseline models.
  • Figure 5: ResNet50 Grad-CAM heatmaps and Threshold-Grad-CAM heatmaps across 10 randomly chosen license-free internet images. Top row: Original images. Middle row: Images with Grad-CAM heatmaps. Bottom row: Images with Threshold-Grad-CAM heatmaps.
  • ...and 1 more figures