Table of Contents
Fetching ...

Training Region-based Object Detectors with Online Hard Example Mining

Abhinav Shrivastava, Abhinav Gupta, Ross Girshick

TL;DR

This work tackles the training inefficiency caused by extreme background-object imbalance in region-based detectors. It introduces Online Hard Example Mining (OHEM), an online, loss-driven sampling strategy that selects the hardest RoIs within each image for backpropagation, replacing several manual heuristics. OHEM yields consistent, substantial improvements on standard benchmarks (VOC07/12 and COCO) and is complementary to other advances like multi-scale training and iterative bounding-box regression, achieving state-of-the-art results on VOC07 with extra data. The approach is simple to integrate with existing region-based detectors and scales well to large datasets, making it practically impactful for improving detection accuracy without extensive hyperparameter tuning.

Abstract

The field of object detection has made significant advances riding on the wave of region-based ConvNets, but their training procedure still includes many heuristics and hyperparameters that are costly to tune. We present a simple yet surprisingly effective online hard example mining (OHEM) algorithm for training region-based ConvNet detectors. Our motivation is the same as it has always been -- detection datasets contain an overwhelming number of easy examples and a small number of hard examples. Automatic selection of these hard examples can make training more effective and efficient. OHEM is a simple and intuitive algorithm that eliminates several heuristics and hyperparameters in common use. But more importantly, it yields consistent and significant boosts in detection performance on benchmarks like PASCAL VOC 2007 and 2012. Its effectiveness increases as datasets become larger and more difficult, as demonstrated by the results on the MS COCO dataset. Moreover, combined with complementary advances in the field, OHEM leads to state-of-the-art results of 78.9% and 76.3% mAP on PASCAL VOC 2007 and 2012 respectively.

Training Region-based Object Detectors with Online Hard Example Mining

TL;DR

This work tackles the training inefficiency caused by extreme background-object imbalance in region-based detectors. It introduces Online Hard Example Mining (OHEM), an online, loss-driven sampling strategy that selects the hardest RoIs within each image for backpropagation, replacing several manual heuristics. OHEM yields consistent, substantial improvements on standard benchmarks (VOC07/12 and COCO) and is complementary to other advances like multi-scale training and iterative bounding-box regression, achieving state-of-the-art results on VOC07 with extra data. The approach is simple to integrate with existing region-based detectors and scales well to large datasets, making it practically impactful for improving detection accuracy without extensive hyperparameter tuning.

Abstract

The field of object detection has made significant advances riding on the wave of region-based ConvNets, but their training procedure still includes many heuristics and hyperparameters that are costly to tune. We present a simple yet surprisingly effective online hard example mining (OHEM) algorithm for training region-based ConvNet detectors. Our motivation is the same as it has always been -- detection datasets contain an overwhelming number of easy examples and a small number of hard examples. Automatic selection of these hard examples can make training more effective and efficient. OHEM is a simple and intuitive algorithm that eliminates several heuristics and hyperparameters in common use. But more importantly, it yields consistent and significant boosts in detection performance on benchmarks like PASCAL VOC 2007 and 2012. Its effectiveness increases as datasets become larger and more difficult, as demonstrated by the results on the MS COCO dataset. Moreover, combined with complementary advances in the field, OHEM leads to state-of-the-art results of 78.9% and 76.3% mAP on PASCAL VOC 2007 and 2012 respectively.

Paper Structure

This paper contains 32 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Architecture of the Fast R-CNN approach (see Section \ref{['sec:frcn-overview-sec']} for details).
  • Figure 2: Architecture of the proposed training algorithm. Given an image, and selective search RoIs, the conv network computes a conv feature map. In (a), the readonly RoI network runs a forward pass on the feature map and all RoIs (shown in green arrows). Then the Hard RoI module uses these RoI losses to select $B$ examples. In (b), these hard examples are used by the RoI network to compute forward and backward passes (shown in red arrows).
  • Figure 3: Training loss is computed for various training procedures using VGG16 networks discussed in Section \ref{['sec:analyze']}. We report mean loss per RoI. These results indicate that using hard mining for training leads to lower training loss than any of the other heuristics.