Multi-view Feature Augmentation with Adaptive Class Activation Mapping
Xiang Gao, Yingjie Tian, Zhiquan Qi
TL;DR
This work tackles the limitation of global average pooling (GAP) in image classification, where background features can bias representations when class-relevant objects are small. It introduces AdaCAM, a forward, label-agnostic attention mechanism that yields class-discriminative maps through an efficient forward pass, and MV-FeaAug, which samples multiple local views around attended regions to form an ensemble of augmented features. The method combines a global auxiliary loss with a local multi-view loss, L_{total}=L_{global}+L_{local}, and at inference aggregates predictions across all sampled views via p=\arg\max_c(\sum_{i=1}^{K\times R} g_{i,c}). Empirical results across diverse datasets and backbones show consistent improvements over GAP, with ablations confirming the contributions of both AdaCAM and feature augmentation; the approach also supports alternative classifier heads (including prototype-based ones) and can enable model compression by allowing shallower backbones to achieve competitive accuracy. The practical impact lies in robust, data-efficient image classification that leverages attention-guided feature augmentation and multi-view ensemble predictions with minimal architectural overhead. Key formulas include the attention map $A=\sum_{i=1}^{C} G_i f_i$, the per-view predictions $s_i=\text{Softmax}(\text{GAP}(G_i))$, and the overall ensemble decision $p=\arg\max_c\sum_i g_{i,c}$, with $L_{total}=L_{global}+L_{local}$ guiding joint optimization.
Abstract
We propose an end-to-end-trainable feature augmentation module built for image classification that extracts and exploits multi-view local features to boost model performance. Different from using global average pooling (GAP) to extract vectorized features from only the global view, we propose to sample and ensemble diverse multi-view local features to improve model robustness. To sample class-representative local features, we incorporate a simple auxiliary classifier head (comprising only one 1$\times$1 convolutional layer) which efficiently and adaptively attends to class-discriminative local regions of feature maps via our proposed AdaCAM (Adaptive Class Activation Mapping). Extensive experiments demonstrate consistent and noticeable performance gains achieved by our multi-view feature augmentation module.
