Table of Contents
Fetching ...

Learning to Discover Multi-Class Attentional Regions for Multi-Label Image Recognition

Bin-Bin Gao, Hong-Yu Zhou

TL;DR

The paper tackles multi-label image recognition by introducing MCAR, a two-stream framework that learns global and local image semantics in a unified model. A lightweight multi-class attentional region module discovers a small, diverse set of class-specific regions by selecting topN class attentional maps and localizing discriminative areas via row/column marginals, requiring no extra label annotations. The two streams are jointly trained with dedicated losses and fused at inference, achieving state-of-the-art mAP on MS-COCO and PASCAL VOC with various backbones and input sizes. The approach emphasizes efficiency, parameter-free region localization, and robustness to pooling strategies and architectures, with promising implications for scalable, label-independent multi-label vision tasks.

Abstract

Multi-label image recognition is a practical and challenging task compared to single-label image classification. However, previous works may be suboptimal because of a great number of object proposals or complex attentional region generation modules. In this paper, we propose a simple but efficient two-stream framework to recognize multi-category objects from global image to local regions, similar to how human beings perceive objects. To bridge the gap between global and local streams, we propose a multi-class attentional region module which aims to make the number of attentional regions as small as possible and keep the diversity of these regions as high as possible. Our method can efficiently and effectively recognize multi-class objects with an affordable computation cost and a parameter-free region localization module. Over three benchmarks on multi-label image classification, we create new state-of-the-art results with a single model only using image semantics without label dependency. In addition, the effectiveness of the proposed method is extensively demonstrated under different factors such as global pooling strategy, input size and network architecture. Code has been made available at~\url{https://github.com/gaobb/MCAR}.

Learning to Discover Multi-Class Attentional Regions for Multi-Label Image Recognition

TL;DR

The paper tackles multi-label image recognition by introducing MCAR, a two-stream framework that learns global and local image semantics in a unified model. A lightweight multi-class attentional region module discovers a small, diverse set of class-specific regions by selecting topN class attentional maps and localizing discriminative areas via row/column marginals, requiring no extra label annotations. The two streams are jointly trained with dedicated losses and fused at inference, achieving state-of-the-art mAP on MS-COCO and PASCAL VOC with various backbones and input sizes. The approach emphasizes efficiency, parameter-free region localization, and robustness to pooling strategies and architectures, with promising implications for scalable, label-independent multi-label vision tasks.

Abstract

Multi-label image recognition is a practical and challenging task compared to single-label image classification. However, previous works may be suboptimal because of a great number of object proposals or complex attentional region generation modules. In this paper, we propose a simple but efficient two-stream framework to recognize multi-category objects from global image to local regions, similar to how human beings perceive objects. To bridge the gap between global and local streams, we propose a multi-class attentional region module which aims to make the number of attentional regions as small as possible and keep the diversity of these regions as high as possible. Our method can efficiently and effectively recognize multi-class objects with an affordable computation cost and a parameter-free region localization module. Over three benchmarks on multi-label image classification, we create new state-of-the-art results with a single model only using image semantics without label dependency. In addition, the effectiveness of the proposed method is extensively demonstrated under different factors such as global pooling strategy, input size and network architecture. Code has been made available at~\url{https://github.com/gaobb/MCAR}.

Paper Structure

This paper contains 12 sections, 15 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: The pipeline of our MCAR framework for multi-label image recognition. MCAR firstly feeds an input image into a deep CNN model to extract its global feature representation through the global image stream. Then, the multi-class attentional region module roughly localizes possible object regions by integrating that information from the global stream. Finally, these localized regions are fed to the shared CNN to obtain their predicted class distributions through the local region stream. At the inference stage, MCAR aggregates predictions from global and local streams with category-wise max-pooling and produces the final prediction.
  • Figure 2: The visualization of local region localization with class attentional map. We firstly decompose the class attentional map into two marginal distributions along row and column. Then, the class attentional region is localized by these two marginal distributions.
  • Figure 3: Some examples of margin distribution. Black curves represent the margin distribution, and blue dash is the threshold $\tau$, and the best interval between two red dashes is the desirable localization.
  • Figure 4: AP (in $\%$) of each category of our proposed framework and the ResNet-101 baseline on MS-COCO dataset. Our MCAR has significant improvements on almost all categories, especially for some difficult categories such as "toaster" and "hair drier".
  • Figure 5: mAP comparisons of our MCAR with different values of $topN$ and $\tau$. The left three columns are based on PASCAL-VOC 2007 and the right three columns are based on MS-COCO dataset.
  • ...and 1 more figures