Table of Contents
Fetching ...

Multi-view Feature Augmentation with Adaptive Class Activation Mapping

Xiang Gao, Yingjie Tian, Zhiquan Qi

TL;DR

This work tackles the limitation of global average pooling (GAP) in image classification, where background features can bias representations when class-relevant objects are small. It introduces AdaCAM, a forward, label-agnostic attention mechanism that yields class-discriminative maps through an efficient forward pass, and MV-FeaAug, which samples multiple local views around attended regions to form an ensemble of augmented features. The method combines a global auxiliary loss with a local multi-view loss, L_{total}=L_{global}+L_{local}, and at inference aggregates predictions across all sampled views via p=\arg\max_c(\sum_{i=1}^{K\times R} g_{i,c}). Empirical results across diverse datasets and backbones show consistent improvements over GAP, with ablations confirming the contributions of both AdaCAM and feature augmentation; the approach also supports alternative classifier heads (including prototype-based ones) and can enable model compression by allowing shallower backbones to achieve competitive accuracy. The practical impact lies in robust, data-efficient image classification that leverages attention-guided feature augmentation and multi-view ensemble predictions with minimal architectural overhead. Key formulas include the attention map $A=\sum_{i=1}^{C} G_i f_i$, the per-view predictions $s_i=\text{Softmax}(\text{GAP}(G_i))$, and the overall ensemble decision $p=\arg\max_c\sum_i g_{i,c}$, with $L_{total}=L_{global}+L_{local}$ guiding joint optimization.

Abstract

We propose an end-to-end-trainable feature augmentation module built for image classification that extracts and exploits multi-view local features to boost model performance. Different from using global average pooling (GAP) to extract vectorized features from only the global view, we propose to sample and ensemble diverse multi-view local features to improve model robustness. To sample class-representative local features, we incorporate a simple auxiliary classifier head (comprising only one 1$\times$1 convolutional layer) which efficiently and adaptively attends to class-discriminative local regions of feature maps via our proposed AdaCAM (Adaptive Class Activation Mapping). Extensive experiments demonstrate consistent and noticeable performance gains achieved by our multi-view feature augmentation module.

Multi-view Feature Augmentation with Adaptive Class Activation Mapping

TL;DR

This work tackles the limitation of global average pooling (GAP) in image classification, where background features can bias representations when class-relevant objects are small. It introduces AdaCAM, a forward, label-agnostic attention mechanism that yields class-discriminative maps through an efficient forward pass, and MV-FeaAug, which samples multiple local views around attended regions to form an ensemble of augmented features. The method combines a global auxiliary loss with a local multi-view loss, L_{total}=L_{global}+L_{local}, and at inference aggregates predictions across all sampled views via p=\arg\max_c(\sum_{i=1}^{K\times R} g_{i,c}). Empirical results across diverse datasets and backbones show consistent improvements over GAP, with ablations confirming the contributions of both AdaCAM and feature augmentation; the approach also supports alternative classifier heads (including prototype-based ones) and can enable model compression by allowing shallower backbones to achieve competitive accuracy. The practical impact lies in robust, data-efficient image classification that leverages attention-guided feature augmentation and multi-view ensemble predictions with minimal architectural overhead. Key formulas include the attention map , the per-view predictions , and the overall ensemble decision , with guiding joint optimization.

Abstract

We propose an end-to-end-trainable feature augmentation module built for image classification that extracts and exploits multi-view local features to boost model performance. Different from using global average pooling (GAP) to extract vectorized features from only the global view, we propose to sample and ensemble diverse multi-view local features to improve model robustness. To sample class-representative local features, we incorporate a simple auxiliary classifier head (comprising only one 11 convolutional layer) which efficiently and adaptively attends to class-discriminative local regions of feature maps via our proposed AdaCAM (Adaptive Class Activation Mapping). Extensive experiments demonstrate consistent and noticeable performance gains achieved by our multi-view feature augmentation module.
Paper Structure (19 sections, 33 equations, 22 figures, 8 tables)

This paper contains 19 sections, 33 equations, 22 figures, 8 tables.

Figures (22)

  • Figure 1: Example images of "tench", "English springer", "golf ball" categories in ImageNet dataset. Even for images of the same class, the class-related objects, which we annotate with red boxes, vary a lot in scale. We consider that for an image with a small scale of class-related object, the final image representation extracted by global average pooling (GAP) could be corrupted by class-irrelevant background features, and thus is less representative of the corresponding class. This motivates us to attend to local region of class-related object and extract more class-representative image representations by sampling local features around the attended region.
  • Figure 2: Comparison between the image classification architecture with the general GAP (left) and our MV-FeaAug module (right). We sample diverse local features around the class-discriminative region of the final convolutional feature maps as multi-view local image representations for ensembled classification, as compared with GAP that extracts only global-view image representation.
  • Figure 3: Adaptive class activation mapping (AdaCAM). We replace the traditional classifier head made up of [GAP$\rightarrow$MLP$\rightarrow$Softmax] with [MLPConv$\rightarrow$GAP$\rightarrow$Softmax] (MLPConv comprises consecutive Conv$_{1\times1}$ layers joined by non-linear activations) to maintain spatial resolution of feature maps. The AdaCAM is obtained by performing channel-wise weighted sum of the last convolutional feature maps with respect to the softmax logit vector.
  • Figure 4: Overview of MV-FeaAug. We concurrently train an auxiliary classifier head comprised of only one 1$\times$1 convolutional layer for dynamic generation of AdaCAM, based on which we sample multiple local representations on the final convolutional feature maps as multi-view inputs to the main classifier head. The main classifier head comprises (but not restricted to) a single fully-connected layer.
  • Figure 5: Visual comparison between CAM and our AdaCAM (evaluated on Imagenette validation set) in object localization. Refer to supplementary materials for more results.
  • ...and 17 more figures