Table of Contents
Fetching ...

Improving Weakly-Supervised Object Localization Using Adversarial Erasing and Pseudo Label

Byeongkeun Kang, Sinhae Cha, Yeejin Lee

TL;DR

Problem: localize objects under weak supervision with only image-level labels. Approach: a three-component WSOL network with adversarial erasing losses on feature maps and foreground masks, plus a pixel-level pseudo-label loss guiding background suppression and foreground activation; the total objective blends seven terms: $L = L_{cls} + gamma1 L_{cls_fg} + gamma2 L_{ae} + gamma3 L_{ae_fg} + gamma4 L_{pseudo} + gamma5 L_{bas} + gamma6 L_{ac}$. Key contributions: implicit full-object localization without extra inference-time branches, two complementary erasing losses, and pixel-level pseudo-label supervision that improve localization accuracy. Findings: the method achieves state-of-the-art localization across ILSVRC-2012, CUB-200-2011, and PASCAL VOC 2012 for two backbones, with ablations confirming the contribution of each loss and the advantage of higher-resolution shared features. Significance: provides a practical, end-to-end WSOL framework that better suppresses backgrounds and covers the full object, enabling robust localization in weakly-supervised settings.

Abstract

Weakly-supervised learning approaches have gained significant attention due to their ability to reduce the effort required for human annotations in training neural networks. This paper investigates a framework for weakly-supervised object localization, which aims to train a neural network capable of predicting both the object class and its location using only images and their image-level class labels. The proposed framework consists of a shared feature extractor, a classifier, and a localizer. The localizer predicts pixel-level class probabilities, while the classifier predicts the object class at the image level. Since image-level class labels are insufficient for training the localizer, weakly-supervised object localization methods often encounter challenges in accurately localizing the entire object region. To address this issue, the proposed method incorporates adversarial erasing and pseudo labels to improve localization accuracy. Specifically, novel losses are designed to utilize adversarially erased foreground features and adversarially erased feature maps, reducing dependence on the most discriminative region. Additionally, the proposed method employs pseudo labels to suppress activation values in the background while increasing them in the foreground. The proposed method is applied to two backbone networks (MobileNetV1 and InceptionV3) and is evaluated on three publicly available datasets (ILSVRC-2012, CUB-200-2011, and PASCAL VOC 2012). The experimental results demonstrate that the proposed method outperforms previous state-of-the-art methods across all evaluated metrics.

Improving Weakly-Supervised Object Localization Using Adversarial Erasing and Pseudo Label

TL;DR

Problem: localize objects under weak supervision with only image-level labels. Approach: a three-component WSOL network with adversarial erasing losses on feature maps and foreground masks, plus a pixel-level pseudo-label loss guiding background suppression and foreground activation; the total objective blends seven terms: . Key contributions: implicit full-object localization without extra inference-time branches, two complementary erasing losses, and pixel-level pseudo-label supervision that improve localization accuracy. Findings: the method achieves state-of-the-art localization across ILSVRC-2012, CUB-200-2011, and PASCAL VOC 2012 for two backbones, with ablations confirming the contribution of each loss and the advantage of higher-resolution shared features. Significance: provides a practical, end-to-end WSOL framework that better suppresses backgrounds and covers the full object, enabling robust localization in weakly-supervised settings.

Abstract

Weakly-supervised learning approaches have gained significant attention due to their ability to reduce the effort required for human annotations in training neural networks. This paper investigates a framework for weakly-supervised object localization, which aims to train a neural network capable of predicting both the object class and its location using only images and their image-level class labels. The proposed framework consists of a shared feature extractor, a classifier, and a localizer. The localizer predicts pixel-level class probabilities, while the classifier predicts the object class at the image level. Since image-level class labels are insufficient for training the localizer, weakly-supervised object localization methods often encounter challenges in accurately localizing the entire object region. To address this issue, the proposed method incorporates adversarial erasing and pseudo labels to improve localization accuracy. Specifically, novel losses are designed to utilize adversarially erased foreground features and adversarially erased feature maps, reducing dependence on the most discriminative region. Additionally, the proposed method employs pseudo labels to suppress activation values in the background while increasing them in the foreground. The proposed method is applied to two backbone networks (MobileNetV1 and InceptionV3) and is evaluated on three publicly available datasets (ILSVRC-2012, CUB-200-2011, and PASCAL VOC 2012). The experimental results demonstrate that the proposed method outperforms previous state-of-the-art methods across all evaluated metrics.
Paper Structure (10 sections, 13 equations, 12 figures, 12 tables)

This paper contains 10 sections, 13 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Illustration of weakly-supervised object localization.
  • Figure 2: Illustration of the proposed two adversarial erasing-based losses and a pseudo-label-based loss. While we apply adversarial erasing to feature maps from the classifier rather than the image itself, the erased regions are overlaid on the image and denoted by the cyan color for better visibility. The loss on the rightmost side is computed using the feature map where discriminative regions are erased. The loss on the leftmost side takes features only from the foreground region with erasing to train the localizer. Pixel-level pseudo labels are generated and utilized as an auxiliary loss to suppress the background while activating the entire object region.
  • Figure 3: Illustration of the proposed framework. Given an input image, it aims to predict an image-level class label and a pixel-level activation map. The pixel-level activation map can be interpreted as the probability of the existence of a specific object at each pixel. Given an input image, a feature extractor first processes it to encode a shared feature representation for both classification and localization. Then, the feature map is processed by a classifier and a localizer, separately.
  • Figure 4: The proposed framework during inference. $E_f$, $E_c$, and $E_l$ represent a feature extractor, a classifier, and a localizer, respectively. $\boldsymbol{I}$, $\boldsymbol{F}^{f}$, $\boldsymbol{F}^{c}$, $\boldsymbol{F}^{cam}$, and $\boldsymbol{F}^{fg}$ denote an input image, a feature map, a score map, and a class activation map, and a foreground mask, respectively. $\boldsymbol{p}$ represents a probability vector for classification. GAP, $\sigma(\cdot)$, and slice denote a global average pooling layer, a softmax function, and extracting a channel from a tensor, respectively.
  • Figure 5: The proposed framework during training. $E_f$, $E_c$, and $E_l$ represent a feature extractor, a classifier, and a localizer, respectively. $\boldsymbol{I}$, $\boldsymbol{F}^{f}$, $\boldsymbol{F}^{c}$, $\boldsymbol{F}^{cam}$, and $\boldsymbol{F}^{fg}$ denote an input image, a feature map, a score map, and a class activation map, and a foreground mask, respectively. $\boldsymbol{p}$ represents a probability vector for classification. GAP, $\sigma(\cdot)$, and slice denote a global average pooling layer, a softmax function, and extracting a channel from a tensor, respectively.
  • ...and 7 more figures