Table of Contents
Fetching ...

Winsor-CAM: Human-Tunable Visual Explanations from Deep Networks via Layer-Wise Winsorization

Casey Wall, Longwei Wang, Rodrigue Rizk, KC Santosh

TL;DR

Winsor-CAM is introduced, a single-pass gradient-based method that aggregates Grad-CAM maps from all convolutional layers and applies percentile-based Winsorization to attenuate outlier contributions and provides an efficient, robust, and human-tunable explanation tool for expert-in-the-loop analysis.

Abstract

Interpreting Convolutional Neural Networks (CNNs) is critical for safety-sensitive applications such as healthcare and autonomous systems. Popular visual explanation methods like Grad-CAM use a single convolutional layer, potentially missing multi-scale cues and producing unstable saliency maps. We introduce Winsor-CAM, a single-pass gradient-based method that aggregates Grad-CAM maps from all convolutional layers and applies percentile-based Winsorization to attenuate outlier contributions. A user-controllable percentile parameter p enables semantic-level tuning from low-level textures to high-level object patterns. We evaluate Winsor-CAM on six CNN architectures using PASCAL VOC 2012 and PolypGen, comparing localization (IoU, center-of-mass distance) and fidelity (insertion/deletion AUC) against seven baselines including Grad-CAM, Grad-CAM++, LayerCAM, ScoreCAM, AblationCAM, ShapleyCAM, and FullGrad. On DenseNet121 with a subset of Pascal VOC 2012, Winsor-CAM achieves 46.8% IoU and 0.059 CoM distance versus 39.0% and 0.074 for Grad-CAM, with improved insertion AUC (0.656 vs. 0.623) and deletion AUC (0.197 vs. 0.242). Notably, even the worst-performing fixed p-value configuration outperforms FullGrad across all metrics. An ablation study confirms that incorporating earlier layers improves localization. Similar evaluation on PolypGen polyp segmentation further validates Winsor-CAM's effectiveness in medical imaging contexts. Winsor-CAM provides an efficient, robust, and human-tunable explanation tool for expert-in-the-loop analysis.

Winsor-CAM: Human-Tunable Visual Explanations from Deep Networks via Layer-Wise Winsorization

TL;DR

Winsor-CAM is introduced, a single-pass gradient-based method that aggregates Grad-CAM maps from all convolutional layers and applies percentile-based Winsorization to attenuate outlier contributions and provides an efficient, robust, and human-tunable explanation tool for expert-in-the-loop analysis.

Abstract

Interpreting Convolutional Neural Networks (CNNs) is critical for safety-sensitive applications such as healthcare and autonomous systems. Popular visual explanation methods like Grad-CAM use a single convolutional layer, potentially missing multi-scale cues and producing unstable saliency maps. We introduce Winsor-CAM, a single-pass gradient-based method that aggregates Grad-CAM maps from all convolutional layers and applies percentile-based Winsorization to attenuate outlier contributions. A user-controllable percentile parameter p enables semantic-level tuning from low-level textures to high-level object patterns. We evaluate Winsor-CAM on six CNN architectures using PASCAL VOC 2012 and PolypGen, comparing localization (IoU, center-of-mass distance) and fidelity (insertion/deletion AUC) against seven baselines including Grad-CAM, Grad-CAM++, LayerCAM, ScoreCAM, AblationCAM, ShapleyCAM, and FullGrad. On DenseNet121 with a subset of Pascal VOC 2012, Winsor-CAM achieves 46.8% IoU and 0.059 CoM distance versus 39.0% and 0.074 for Grad-CAM, with improved insertion AUC (0.656 vs. 0.623) and deletion AUC (0.197 vs. 0.242). Notably, even the worst-performing fixed p-value configuration outperforms FullGrad across all metrics. An ablation study confirms that incorporating earlier layers improves localization. Similar evaluation on PolypGen polyp segmentation further validates Winsor-CAM's effectiveness in medical imaging contexts. Winsor-CAM provides an efficient, robust, and human-tunable explanation tool for expert-in-the-loop analysis.

Paper Structure

This paper contains 22 sections, 16 equations, 11 figures, 12 tables, 1 algorithm.

Figures (11)

  • Figure 1: Comparison of Winsor-CAM and standard Grad-CAM outputs on a ResNet-50 model, illustrating improved localization and robustness to interpolation artifacts. Winsor-CAM produces smoother, semantically aligned heatmaps under both bilinear and nearest-neighbor upsampling, while Grad-CAM exhibits spatial distortion and noise, particularly under nearest interpolation. ImageNet validation set, class: "bald eagle".
  • Figure 2: Comparison of multi-layer CAM-based methods on an image from ImageNet with the class "Admiral" (butterfly) with VGG16. Winsor-CAM (top left, $p=80$) and FullGrad srinivas2019full (top right) outputs were generated in this work. EigenLayer-CAM (bottom left) and HiResRP-CAM (bottom right) visualizations are adapted from Andrei2025 for comparison, licensed under CC BY 4.0.
  • Figure 3: Step-by-step visualization of the Winsor-CAM pipeline for layer-wise mean importance. From left to right: input image (ImageNet validation set, class: "goldfish"), raw layer-wise importance scores (before ReLU), positive importance scores after ReLU $\Gamma^c_i$ (Eq. \ref{['eq:layer_importance_mean']}), Winsorized importance on normalized scale (Steps \ref{['step4']} and \ref{['step5']}), and final Winsor-CAM heatmap overlay. This shows how layer-wise importance is extracted, outliers suppressed via Winsorization, and importance structure preserved.
  • Figure 4: Comparison of Grad-CAM and Winsor-CAM pipelines. Top: Standard Grad-CAM applied to a single convolutional layer. Gradients w.r.t. target class are computed and average-pooled to obtain filter-wise importance weights, which linearly combine activation maps to produce a heatmap at the layer's spatial resolution. Bottom: Winsor-CAM pipeline. Grad-CAM maps are computed for all convolutional layers, and corresponding importance weights yield layer-wise importance scores. These scores are Winsorized to suppress outliers and normalized (both operations account for zero-valued layers, as in Fig. \ref{['fig:Overlay']}). Normalized importance scores then weight the interpolated Grad-CAM maps from all layers, producing the final high-resolution heatmap.
  • Figure 5: Progression of Winsor-CAM visualizations on DenseNet121 as percentile $p$ varies from 0 to 100 (increments of 10) for class "goldfish". Top: raw heatmaps showing saliency shift from fine-grained details to broader patterns as $p$ increases. Middle: binarized masks after thresholding. Bottom: layer-wise importance distributions (x-axis = layer index, early to late; y-axis = normalized importance after Winsorization). Lower $p$-values suppress extreme scores and emphasize early-layer features (textures, edges), while higher $p$-values retain broader contributions from deeper layers (coarser, high-level saliency). This demonstrates Winsor-CAM's semantic-level control over explanation granularity.
  • ...and 6 more figures