Table of Contents
Fetching ...

Light-Head R-CNN: In Defense of Two-Stage Object Detector

Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, Jian Sun

TL;DR

The paper addresses the speed-accuracy conflict in two-stage object detectors by proposing Light-Head R-CNN, which replaces heavy RoI heads with thin feature maps produced by large-kernel separable convolutions and a lightweight R-CNN subnetwork. This design enables fast per-RoI predictions while preserving high localization and classification performance across backbones, including ResNet-101 and a compact Xception-like network. On COCO, it achieves a single-scale mmAP of around 40.8% and up to 41.5% with multi-scale training, surpassing many single-stage and traditional two-stage detectors, and reaches 102 FPS with a tiny backbone. The work demonstrates that a lighter head can unlock both speed and accuracy for two-stage detectors, offering practical deployment advantages and suggesting further optimizations via RoI pooling refinements and backbone scaling.

Abstract

In this paper, we first investigate why typical two-stage methods are not as fast as single-stage, fast detectors like YOLO and SSD. We find that Faster R-CNN and R-FCN perform an intensive computation after or before RoI warping. Faster R-CNN involves two fully connected layers for RoI recognition, while R-FCN produces a large score maps. Thus, the speed of these networks is slow due to the heavy-head design in the architecture. Even if we significantly reduce the base model, the computation cost cannot be largely decreased accordingly. We propose a new two-stage detector, Light-Head R-CNN, to address the shortcoming in current two-stage approaches. In our design, we make the head of network as light as possible, by using a thin feature map and a cheap R-CNN subnet (pooling and single fully-connected layer). Our ResNet-101 based light-head R-CNN outperforms state-of-art object detectors on COCO while keeping time efficiency. More importantly, simply replacing the backbone with a tiny network (e.g, Xception), our Light-Head R-CNN gets 30.7 mmAP at 102 FPS on COCO, significantly outperforming the single-stage, fast detectors like YOLO and SSD on both speed and accuracy. Code will be made publicly available.

Light-Head R-CNN: In Defense of Two-Stage Object Detector

TL;DR

The paper addresses the speed-accuracy conflict in two-stage object detectors by proposing Light-Head R-CNN, which replaces heavy RoI heads with thin feature maps produced by large-kernel separable convolutions and a lightweight R-CNN subnetwork. This design enables fast per-RoI predictions while preserving high localization and classification performance across backbones, including ResNet-101 and a compact Xception-like network. On COCO, it achieves a single-scale mmAP of around 40.8% and up to 41.5% with multi-scale training, surpassing many single-stage and traditional two-stage detectors, and reaches 102 FPS with a tiny backbone. The work demonstrates that a lighter head can unlock both speed and accuracy for two-stage detectors, offering practical deployment advantages and suggesting further optimizations via RoI pooling refinements and backbone scaling.

Abstract

In this paper, we first investigate why typical two-stage methods are not as fast as single-stage, fast detectors like YOLO and SSD. We find that Faster R-CNN and R-FCN perform an intensive computation after or before RoI warping. Faster R-CNN involves two fully connected layers for RoI recognition, while R-FCN produces a large score maps. Thus, the speed of these networks is slow due to the heavy-head design in the architecture. Even if we significantly reduce the base model, the computation cost cannot be largely decreased accordingly. We propose a new two-stage detector, Light-Head R-CNN, to address the shortcoming in current two-stage approaches. In our design, we make the head of network as light as possible, by using a thin feature map and a cheap R-CNN subnet (pooling and single fully-connected layer). Our ResNet-101 based light-head R-CNN outperforms state-of-art object detectors on COCO while keeping time efficiency. More importantly, simply replacing the backbone with a tiny network (e.g, Xception), our Light-Head R-CNN gets 30.7 mmAP at 102 FPS on COCO, significantly outperforming the single-stage, fast detectors like YOLO and SSD on both speed and accuracy. Code will be made publicly available.

Paper Structure

This paper contains 17 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparisons of Light Head R-CNN along with previous one-stage and two-stage detectors. We show our results with different backbones (a small Xception like network, Resnet-50, Resnet-101). Thanks for better design principle, our Light Head R-CNN significant outperform all competitors, and provide a new upper envelope. Note that all results reported here are obtained by use single-scale training only. The multi-scale training results are presented in Table \ref{['table:COCO_methods_comparison']}.
  • Figure 2: Overview of our approach. Our Light-Head R-CNN builds "thin" feature maps before RoI warping, by large separable convolution. We adopt a single fully-connected layer with 2048 channels in our R-CNN subnet. Thanks for thinner feature maps and cheap R-CNN subnet, the whole network is highly efficient while keeping accuracy.
  • Figure 3: Large separable convolution performs a $k\times 1$ and $1\times k$ convolution sequentially. The computational complexity can be further controlled through $C_{mid}, C_{out}$.
  • Figure 4: The network to evaluate the impact of thin feature maps. We keep the networks same as R-FCN except that we decrease the feature map channels used for PSRoI pooling. And we add additional fully-connected layers for final prediction.
  • Figure 5: Representative results of our large "L" model.
  • ...and 1 more figures