Table of Contents
Fetching ...

R-FCN: Object Detection via Region-based Fully Convolutional Networks

Jifeng Dai, Yi Li, Kaiming He, Jian Sun

TL;DR

The paper introduces Region-based Fully Convolutional Networks (R-FCN), a fast and accurate object detector that shares almost all computation across the image using a fully convolutional backbone and position-sensitive score maps. By employing position-sensitive RoI pooling, it encodes spatial object information without heavy per-RoI subnetworks, enabling end-to-end training and significant speedups over Faster R-CNN. Across VOC and COCO benchmarks, R-FCN achieves competitive mAP (e.g., 83.6% on VOC07 with COCO pretraining) while delivering substantial runtime efficiency (~0.17s per image on a GPU). The work demonstrates that fully convolutional backbones can be effectively repurposed for precise object localization with minimal per-region overhead, and it provides a public implementation for broader adoption.

Abstract

We present region-based, fully convolutional networks for accurate and efficient object detection. In contrast to previous region-based detectors such as Fast/Faster R-CNN that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image. To achieve this goal, we propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection. Our method can thus naturally adopt fully convolutional image classifier backbones, such as the latest Residual Networks (ResNets), for object detection. We show competitive results on the PASCAL VOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet. Meanwhile, our result is achieved at a test-time speed of 170ms per image, 2.5-20x faster than the Faster R-CNN counterpart. Code is made publicly available at: https://github.com/daijifeng001/r-fcn

R-FCN: Object Detection via Region-based Fully Convolutional Networks

TL;DR

The paper introduces Region-based Fully Convolutional Networks (R-FCN), a fast and accurate object detector that shares almost all computation across the image using a fully convolutional backbone and position-sensitive score maps. By employing position-sensitive RoI pooling, it encodes spatial object information without heavy per-RoI subnetworks, enabling end-to-end training and significant speedups over Faster R-CNN. Across VOC and COCO benchmarks, R-FCN achieves competitive mAP (e.g., 83.6% on VOC07 with COCO pretraining) while delivering substantial runtime efficiency (~0.17s per image on a GPU). The work demonstrates that fully convolutional backbones can be effectively repurposed for precise object localization with minimal per-region overhead, and it provides a public implementation for broader adoption.

Abstract

We present region-based, fully convolutional networks for accurate and efficient object detection. In contrast to previous region-based detectors such as Fast/Faster R-CNN that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image. To achieve this goal, we propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection. Our method can thus naturally adopt fully convolutional image classifier backbones, such as the latest Residual Networks (ResNets), for object detection. We show competitive results on the PASCAL VOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet. Meanwhile, our result is achieved at a test-time speed of 170ms per image, 2.5-20x faster than the Faster R-CNN counterpart. Code is made publicly available at: https://github.com/daijifeng001/r-fcn

Paper Structure

This paper contains 7 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Key idea of R-FCN for object detection. In this illustration, there are $k \times k = 3 \times 3$ position-sensitive score maps generated by a fully convolutional network. For each of the $k \times k$ bins in an RoI, pooling is only performed on one of the $k^2$ maps (marked by different colors).
  • Figure 2: Overall architecture of R-FCN. A Region Proposal Network (RPN) Ren2015a proposes candidate RoIs, which are then applied on the score maps. All learnable weight layers are convolutional and are computed on the entire image; the per-RoI computational cost is negligible.
  • Figure 3: Visualization of R-FCN ($k \times k = 3 \times 3$) for the person category.
  • Figure 5: Curated examples of R-FCN results on the PASCAL VOC 2007 test set (83.6% mAP). The network is ResNet-101, and the training data is 07+12+COCO. A score threshold of 0.6 is used for displaying. The running time per image is 170ms on one Nvidia K40 GPU.
  • Figure 6: Curated examples of R-FCN results on the MS COCO test-dev set (31.5% AP). The network is ResNet-101, and the training data is MS COCO trainval. A score threshold of 0.6 is used for displaying.