Table of Contents
Fetching ...

Beyond Softmax: Dual-Branch Sigmoid Architecture for Accurate Class Activation Maps

Yoojin Oh, Junhyug Noh

TL;DR

This work identifies two fundamental distortions introduced by softmax in CAM explanations—additive logit shifts and sign collapse—which can mislead localization. It proposes a simple, architecture-agnostic solution: a dual-branch sigmoid head that clones the classifier head, trains a per-class sigmoid branch with binary supervision, and uses the sigmoid branch to generate signed, magnitude-preserving heatmaps while keeping the original softmax head frozen. Inference preserves recognition accuracy and yields more faithful explanations by constraining heatmap contributions to positive evidence from the sigmoid branch, compatible with existing CAM variants and WSOL pipelines. Extensive experiments on fine-grained datasets (CUB-200-2011, Stanford Cars) and WSOL benchmarks (ImageNet-1K, OpenImages-30K) show consistent improvements in explanation fidelity and localization without accuracy loss, with only modest training and inference overhead.

Abstract

Class Activation Mapping (CAM) and its extensions have become indispensable tools for visualizing the evidence behind deep network predictions. However, by relying on a final softmax classifier, these methods suffer from two fundamental distortions: additive logit shifts that arbitrarily bias importance scores, and sign collapse that conflates excitatory and inhibitory features. We propose a simple, architecture-agnostic dual-branch sigmoid head that decouples localization from classification. Given any pretrained model, we clone its classification head into a parallel branch ending in per-class sigmoid outputs, freeze the original softmax head, and fine-tune only the sigmoid branch with class-balanced binary supervision. At inference, softmax retains recognition accuracy, while class evidence maps are generated from the sigmoid branch -- preserving both magnitude and sign of feature contributions. Our method integrates seamlessly with most CAM variants and incurs negligible overhead. Extensive evaluations on fine-grained tasks (CUB-200-2011, Stanford Cars) and WSOL benchmarks (ImageNet-1K, OpenImages30K) show improved explanation fidelity and consistent Top-1 Localization gains -- without any drop in classification accuracy. Code is available at https://github.com/finallyupper/beyond-softmax.

Beyond Softmax: Dual-Branch Sigmoid Architecture for Accurate Class Activation Maps

TL;DR

This work identifies two fundamental distortions introduced by softmax in CAM explanations—additive logit shifts and sign collapse—which can mislead localization. It proposes a simple, architecture-agnostic solution: a dual-branch sigmoid head that clones the classifier head, trains a per-class sigmoid branch with binary supervision, and uses the sigmoid branch to generate signed, magnitude-preserving heatmaps while keeping the original softmax head frozen. Inference preserves recognition accuracy and yields more faithful explanations by constraining heatmap contributions to positive evidence from the sigmoid branch, compatible with existing CAM variants and WSOL pipelines. Extensive experiments on fine-grained datasets (CUB-200-2011, Stanford Cars) and WSOL benchmarks (ImageNet-1K, OpenImages-30K) show consistent improvements in explanation fidelity and localization without accuracy loss, with only modest training and inference overhead.

Abstract

Class Activation Mapping (CAM) and its extensions have become indispensable tools for visualizing the evidence behind deep network predictions. However, by relying on a final softmax classifier, these methods suffer from two fundamental distortions: additive logit shifts that arbitrarily bias importance scores, and sign collapse that conflates excitatory and inhibitory features. We propose a simple, architecture-agnostic dual-branch sigmoid head that decouples localization from classification. Given any pretrained model, we clone its classification head into a parallel branch ending in per-class sigmoid outputs, freeze the original softmax head, and fine-tune only the sigmoid branch with class-balanced binary supervision. At inference, softmax retains recognition accuracy, while class evidence maps are generated from the sigmoid branch -- preserving both magnitude and sign of feature contributions. Our method integrates seamlessly with most CAM variants and incurs negligible overhead. Extensive evaluations on fine-grained tasks (CUB-200-2011, Stanford Cars) and WSOL benchmarks (ImageNet-1K, OpenImages30K) show improved explanation fidelity and consistent Top-1 Localization gains -- without any drop in classification accuracy. Code is available at https://github.com/finallyupper/beyond-softmax.

Paper Structure

This paper contains 46 sections, 14 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Softmax-induced distortions in CAM-based localization. (a) Additive Logit Shift: adding a constant $\delta$ to all feature weights leaves the softmax probability $y_k$ unchanged but disproportionately amplifies feature $f_i$ in the heatmap. (b) Sign Collapse: subtracting $\delta$ flips formerly positive feature weights to negative without affecting $y_k$, causing previously highlighted regions to vanish. In both cases, identical classification outputs produce drastically different localization maps.
  • Figure 2: Inference pipeline of the dual‐branch sigmoid CAM. After feature extraction, the frozen softmax head predicts the class label $k^*$. In parallel, any CAM variant computes per‐channel importance scores $\tilde{w}_{k^*}$ (via weights or gradients) for $s_{k^*}$, which are rectified by clamping to positive values. These positive‐only scores are then linearly combined with the feature maps to produce the final class evidence heatmap $\tilde{M}_{k^*}$.
  • Figure 3: Qualitative WSOL on ImageNet-1K for VGG-16 (top) and ResNet-50 (bottom). From left to right: input, CAM, CAM$+$Ours, Grad-CAM, Grad-CAM$+$Ours. The boxes in green and red represent the predictions and ground truths of localization.
  • Figure 4: Additional qualitative WSOL examples on ImageNet-1K using VGG-16 (top), ResNet-50 (middle), and InceptionV3 (bottom). Predicted bounding boxes are shown in green, and ground-truth boxes in red.
  • Figure 5: Additional qualitative explanation examples on fine-grained datasets: VGG-16 on CUB-200-2011 (top) and ResNet-50 on Stanford Cars (bottom).