Table of Contents
Fetching ...

Hide-and-Seek: A Data Augmentation Technique for Weakly-Supervised Localization and Beyond

Krishna Kumar Singh, Hao Yu, Aron Sarmasi, Gautam Pradeep, Yong Jae Lee

TL;DR

Hide-and-Seek introduces a patch-based occlusion data augmentation that hides random patches during training to force networks to utilize multiple object parts, improving weakly-supervised localization and robustness to occlusion. Hidden pixels are filled with the dataset mean $μ$ to align training/testing distributions, and the approach extends to videos via temporal patch hiding. Extensive experiments across object localization, semantic segmentation, temporal action localization, and supervised tasks demonstrate consistent gains across architectures and datasets, highlighting HaS's broad applicability. The work provides practical guidance on patch sizes and hiding probabilities and releases code and models on its project page.

Abstract

We propose 'Hide-and-Seek' a general purpose data augmentation technique, which is complementary to existing data augmentation techniques and is beneficial for various visual recognition tasks. The key idea is to hide patches in a training image randomly, in order to force the network to seek other relevant content when the most discriminative content is hidden. Our approach only needs to modify the input image and can work with any network to improve its performance. During testing, it does not need to hide any patches. The main advantage of Hide-and-Seek over existing data augmentation techniques is its ability to improve object localization accuracy in the weakly-supervised setting, and we therefore use this task to motivate the approach. However, Hide-and-Seek is not tied only to the image localization task, and can generalize to other forms of visual input like videos, as well as other recognition tasks like image classification, temporal action localization, semantic segmentation, emotion recognition, age/gender estimation, and person re-identification. We perform extensive experiments to showcase the advantage of Hide-and-Seek on these various visual recognition problems.

Hide-and-Seek: A Data Augmentation Technique for Weakly-Supervised Localization and Beyond

TL;DR

Hide-and-Seek introduces a patch-based occlusion data augmentation that hides random patches during training to force networks to utilize multiple object parts, improving weakly-supervised localization and robustness to occlusion. Hidden pixels are filled with the dataset mean to align training/testing distributions, and the approach extends to videos via temporal patch hiding. Extensive experiments across object localization, semantic segmentation, temporal action localization, and supervised tasks demonstrate consistent gains across architectures and datasets, highlighting HaS's broad applicability. The work provides practical guidance on patch sizes and hiding probabilities and releases code and models on its project page.

Abstract

We propose 'Hide-and-Seek' a general purpose data augmentation technique, which is complementary to existing data augmentation techniques and is beneficial for various visual recognition tasks. The key idea is to hide patches in a training image randomly, in order to force the network to seek other relevant content when the most discriminative content is hidden. Our approach only needs to modify the input image and can work with any network to improve its performance. During testing, it does not need to hide any patches. The main advantage of Hide-and-Seek over existing data augmentation techniques is its ability to improve object localization accuracy in the weakly-supervised setting, and we therefore use this task to motivate the approach. However, Hide-and-Seek is not tied only to the image localization task, and can generalize to other forms of visual input like videos, as well as other recognition tasks like image classification, temporal action localization, semantic segmentation, emotion recognition, age/gender estimation, and person re-identification. We perform extensive experiments to showcase the advantage of Hide-and-Seek on these various visual recognition problems.

Paper Structure

This paper contains 39 sections, 1 equation, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Main idea. (Top row) A deep network tends to focus on the most discriminative parts of an image (e.g., face of the dog) for classification. (Bottom row) By hiding image patches randomly, we can force the network to focus on other relevant object parts in order to correctly classify the image as 'dog'.
  • Figure 2: Approach overview.Left: For each training image, we divide it into a grid of $S \times S$ patches. Each patch is then randomly hidden with probability $p_{hide}$ and given as input to a CNN. The hidden patches change randomly across different epochs. Right: During testing, the full image without any hidden patches is given as input to the trained network which produces e.g., a classification label and object localization heatmap.
  • Figure 3: There are three types of convolutional filter activations after hiding patches: a convolution filter can be completely within a visible region (blue box), completely within a hidden region (red box), or partially within a visible/hidden region (green box).
  • Figure 4: Conv1 filters of AlexNet trained with Hide-and-Seek on ImageNet. Hiding image patches does not introduce any noticeable artifacts in the learned filters.
  • Figure 5: Qualitative object localization results. We compare our approach with AlexNet-GAP zhou-cvpr2016 on the ILSVRC validation data. For each image, we show the bounding box and CAM obtained by AlexNet-GAP (left) and our method (right). Our Hide-and-Seek approach localizes multiple relevant parts of an object whereas AlexNet-GAP mainly focuses only on the most discriminative part.
  • ...and 3 more figures