Table of Contents
Fetching ...

NOAH: Learning Pairwise Object Category Attentions for Image Classification

Chao Li, Aojun Zhou, Anbang Yao

TL;DR

This work addresses the limitation of global, location-agnostic encoding in standard classification heads by introducing NOAH, a universal head that utilizes pairwise object category attention (POCA) to generate location-specific, category-aware logits. NOAH learns POCAs at local to global scales via a two-level feature split, transform, and merge mechanism, enabling a drop-in replacement across CNN, ViT, and MLP backbones while maintaining similar parameter counts. Extensive experiments on ImageNet and MS-COCO show consistent accuracy improvements, especially for lightweight architectures, and ablations highlight the importance of spatial attention, feature splitting, and summation merging. Visualizations confirm diverse, category-specific spatial attentions learned by POCA, underscoring NOAH’s ability to capture rich spatial cues and improve generalization across tasks and training regimes.

Abstract

A modern deep neural network (DNN) for image classification tasks typically consists of two parts: a backbone for feature extraction, and a head for feature encoding and class predication. We observe that the head structures of mainstream DNNs adopt a similar feature encoding pipeline, exploiting global feature dependencies while disregarding local ones. In this paper, we revisit the feature encoding problem, and propose Non-glObal Attentive Head (NOAH) that relies on a new form of dot-product attention called pairwise object category attention (POCA), efficiently exploiting spatially dense category-specific attentions to augment classification performance. NOAH introduces a neat combination of feature split, transform and merge operations to learn POCAs at local to global scales. As a drop-in design, NOAH can be easily used to replace existing heads of various types of DNNs, improving classification performance while maintaining similar model efficiency. We validate the effectiveness of NOAH on ImageNet classification benchmark with 25 DNN architectures spanning convolutional neural networks, vision transformers and multi-layer perceptrons. In general, NOAH is able to significantly improve the performance of lightweight DNNs, e.g., showing 3.14\%|5.3\%|1.9\% top-1 accuracy improvement to MobileNetV2 (0.5x)|Deit-Tiny (0.5x)|gMLP-Tiny (0.5x). NOAH also generalizes well when applied to medium-size and large-size DNNs. We further show that NOAH retains its efficacy on other popular multi-class and multi-label image classification benchmarks as well as in different training regimes, e.g., showing 3.6\%|1.1\% mAP improvement to large ResNet101|ViT-Large on MS-COCO dataset. Project page: https://github.com/OSVAI/NOAH.

NOAH: Learning Pairwise Object Category Attentions for Image Classification

TL;DR

This work addresses the limitation of global, location-agnostic encoding in standard classification heads by introducing NOAH, a universal head that utilizes pairwise object category attention (POCA) to generate location-specific, category-aware logits. NOAH learns POCAs at local to global scales via a two-level feature split, transform, and merge mechanism, enabling a drop-in replacement across CNN, ViT, and MLP backbones while maintaining similar parameter counts. Extensive experiments on ImageNet and MS-COCO show consistent accuracy improvements, especially for lightweight architectures, and ablations highlight the importance of spatial attention, feature splitting, and summation merging. Visualizations confirm diverse, category-specific spatial attentions learned by POCA, underscoring NOAH’s ability to capture rich spatial cues and improve generalization across tasks and training regimes.

Abstract

A modern deep neural network (DNN) for image classification tasks typically consists of two parts: a backbone for feature extraction, and a head for feature encoding and class predication. We observe that the head structures of mainstream DNNs adopt a similar feature encoding pipeline, exploiting global feature dependencies while disregarding local ones. In this paper, we revisit the feature encoding problem, and propose Non-glObal Attentive Head (NOAH) that relies on a new form of dot-product attention called pairwise object category attention (POCA), efficiently exploiting spatially dense category-specific attentions to augment classification performance. NOAH introduces a neat combination of feature split, transform and merge operations to learn POCAs at local to global scales. As a drop-in design, NOAH can be easily used to replace existing heads of various types of DNNs, improving classification performance while maintaining similar model efficiency. We validate the effectiveness of NOAH on ImageNet classification benchmark with 25 DNN architectures spanning convolutional neural networks, vision transformers and multi-layer perceptrons. In general, NOAH is able to significantly improve the performance of lightweight DNNs, e.g., showing 3.14\%|5.3\%|1.9\% top-1 accuracy improvement to MobileNetV2 (0.5x)|Deit-Tiny (0.5x)|gMLP-Tiny (0.5x). NOAH also generalizes well when applied to medium-size and large-size DNNs. We further show that NOAH retains its efficacy on other popular multi-class and multi-label image classification benchmarks as well as in different training regimes, e.g., showing 3.6\%|1.1\% mAP improvement to large ResNet101|ViT-Large on MS-COCO dataset. Project page: https://github.com/OSVAI/NOAH.
Paper Structure (19 sections, 4 equations, 3 figures, 17 tables)

This paper contains 19 sections, 4 equations, 3 figures, 17 tables.

Figures (3)

  • Figure 1: An architectural overview of DNN backbones appended with a Non-glObal Attentive Head (NOAH). Unlike the Popular Head based on global feature encoding, our NOAH relies on pairwise object category attentions (POCAs) learnt at local to global scales via a neat combination of feature split (two levels), transform and merge operations, taking the feature maps from the last layer of a backbone as the input.
  • Figure 2: Visualizations of NOAH: (a) illustrative visualizations of learnt attention tensors between different POCA blocks for the ground truth object category, and (b) illustrative visualizations of learnt attention tensors of the same POCA block for different object categories. We use the well-trained ResNet18 model with NOAH and image samples in the ImageNet validation set. For comparison, in (b) we also present the visualization results obtained from the well-trained ResNet18 model with the GAP-based head using Grad-CAM++ gradcam for the corresponding object categories.
  • Figure 3: Curves of top-1 training accuracy (dashed line) and validation accuracy (solid line) of the ResNet18/ResNet50/MobileNetV2 ($1.0\times$)/MobileNetV2 ($0.5\times$) models trained on ImageNet dataset with the original head based on global feature encoding vs. NOAH. Comparatively, the ResNet18/ResNet50/MobileNetV2 ($1.0\times$)/MobileNetV2 ($0.5\times$) model with NOAH converges with the best validation accuracy, showing $1.56\%/1.02\%/1.33\%/3.14\%$ top-1 gain to the baseline while maintaining almost the same model size, respectively.