Table of Contents
Fetching ...

Speed/accuracy trade-offs for modern convolutional object detectors

Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, Kevin Murphy

TL;DR

The paper systematically characterizes speed, memory, and accuracy trade-offs across three modern object-detection meta-architectures (SSD, Faster R-CNN, R-FCN) using a unified TensorFlow framework and a diverse set of feature extractors. By varying input sizes, number of proposals, and network backbones, it identifies practical sweet spots and shows that faster detectors with fewer proposals can approach the accuracy of slower, more complex models. It also quantifies the impact of object size and input resolution on performance and demonstrates state-of-the-art single-model COCO results alongside strong ensemble performance. The findings provide concrete guidance for deploying object detectors in real-world applications with tight latency and memory constraints.

Abstract

The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end, we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-to-apples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN [Ren et al., 2015], R-FCN [Dai et al., 2016] and SSD [Liu et al., 2015] systems, which we view as "meta-architectures" and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that achieves real time speeds and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.

Speed/accuracy trade-offs for modern convolutional object detectors

TL;DR

The paper systematically characterizes speed, memory, and accuracy trade-offs across three modern object-detection meta-architectures (SSD, Faster R-CNN, R-FCN) using a unified TensorFlow framework and a diverse set of feature extractors. By varying input sizes, number of proposals, and network backbones, it identifies practical sweet spots and shows that faster detectors with fewer proposals can approach the accuracy of slower, more complex models. It also quantifies the impact of object size and input resolution on performance and demonstrates state-of-the-art single-model COCO results alongside strong ensemble performance. The findings provide concrete guidance for deploying object detectors in real-world applications with tight latency and memory constraints.

Abstract

The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end, we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-to-apples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN [Ren et al., 2015], R-FCN [Dai et al., 2016] and SSD [Liu et al., 2015] systems, which we view as "meta-architectures" and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that achieves real time speeds and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.

Paper Structure

This paper contains 36 sections, 1 equation, 17 figures, 6 tables.

Figures (17)

  • Figure 1: High level diagrams of the detection meta-architectures compared in this paper.
  • Figure 2: Accuracy vs time, with marker shapes indicating meta-architecture and colors indicating feature extractor. Each (meta-architecture, feature extractor) pair can correspond to multiple points on this plot due to changing input sizes, stride, etc.
  • Figure 3: Accuracy of detector (mAP on COCO) vs accuracy of feature extractor (as measured by top-1 accuracy on ImageNet-CLS). To avoid crowding the plot, we show only the low resolution models.
  • Figure 4: Accuracy stratified by object size, meta-architecture and feature extractor, We fix the image resolution to 300.
  • Figure 5: Effect of image resolution.
  • ...and 12 more figures