Table of Contents
Fetching ...

EfficientDet: Scalable and Efficient Object Detection

Mingxing Tan, Ruoming Pang, Quoc V. Le

TL;DR

EfficientDet tackles the challenge of high-accuracy object detection under diverse resource limits by introducing BiFPN for efficient multiscale feature fusion and a compound scaling method that jointly scales backbone, feature network, heads, and input resolution. By pairing EfficientNet backbones with BiFPN and shared prediction heads, the authors construct a family of detectors that achieve state-of-the-art COCO AP with far fewer parameters and FLOPs. An extensive ablation study confirms the value of bidirectional cross-scale connections, weighted (and fast normalized) fusion, and the compound scaling strategy. Practically, EfficientDet delivers substantial speedups and efficiency gains across CPU and GPU hardware while maintaining high accuracy, enabling deployment in mobile and edge settings as well as datacenters.

Abstract

Model efficiency has become increasingly important in computer vision. In this paper, we systematically study neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multiscale feature fusion; Second, we propose a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time. Based on these optimizations and better backbones, we have developed a new family of object detectors, called EfficientDet, which consistently achieve much better efficiency than prior art across a wide spectrum of resource constraints. In particular, with single model and single-scale, our EfficientDet-D7 achieves state-of-the-art 55.1 AP on COCO test-dev with 77M parameters and 410B FLOPs, being 4x - 9x smaller and using 13x - 42x fewer FLOPs than previous detectors. Code is available at https://github.com/google/automl/tree/master/efficientdet.

EfficientDet: Scalable and Efficient Object Detection

TL;DR

EfficientDet tackles the challenge of high-accuracy object detection under diverse resource limits by introducing BiFPN for efficient multiscale feature fusion and a compound scaling method that jointly scales backbone, feature network, heads, and input resolution. By pairing EfficientNet backbones with BiFPN and shared prediction heads, the authors construct a family of detectors that achieve state-of-the-art COCO AP with far fewer parameters and FLOPs. An extensive ablation study confirms the value of bidirectional cross-scale connections, weighted (and fast normalized) fusion, and the compound scaling strategy. Practically, EfficientDet delivers substantial speedups and efficiency gains across CPU and GPU hardware while maintaining high accuracy, enabling deployment in mobile and edge settings as well as datacenters.

Abstract

Model efficiency has become increasingly important in computer vision. In this paper, we systematically study neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multiscale feature fusion; Second, we propose a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time. Based on these optimizations and better backbones, we have developed a new family of object detectors, called EfficientDet, which consistently achieve much better efficiency than prior art across a wide spectrum of resource constraints. In particular, with single model and single-scale, our EfficientDet-D7 achieves state-of-the-art 55.1 AP on COCO test-dev with 77M parameters and 410B FLOPs, being 4x - 9x smaller and using 13x - 42x fewer FLOPs than previous detectors. Code is available at https://github.com/google/automl/tree/master/efficientdet.

Paper Structure

This paper contains 27 sections, 5 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Model FLOPs vs. COCO accuracy -- All numbers are for single-model single-scale. Our EfficientDet achieves new state-of-the-art 55.1% COCO AP with much fewer parameters and FLOPs than previous detectors. More studies on different backbones and FPN/NAS-FPN/BiFPN are in Table \ref{['tab:backbonefpn']} and \ref{['tab:bifpncompare']}. Complete results are in Table \ref{['tab:coco']}.
  • Figure 2: Feature network design -- (a) FPN fpn17 introduces a top-down pathway to fuse multi-scale features from level 3 to 7 ($P_3$ - $P_7$); (b) PANet panet18 adds an additional bottom-up pathway on top of FPN; (c) NAS-FPN nasfpn19 use neural architecture search to find an irregular feature network topology and then repeatedly apply the same block; (d) is our BiFPN with better accuracy and efficiency trade-offs.
  • Figure 3: EfficientDet architecture -- It employs EfficientNet efficientnet19 as the backbone network, BiFPN as the feature network, and shared class/box prediction network. Both BiFPN layers and class/box net layers are repeated multiple times based on different resource constraints as shown in Table \ref{['tab:scaleconfigs']}.
  • Figure 4: Model size and inference latency comparison -- Latency is measured with batch size 1 on the same machine equipped with a Titan V GPU and Xeon CPU. AN denotes AmoebaNet + NAS-FPN trained with auto-augmentation odaa19. Our EfficientDet models are 4x - 9x smaller, 2x - 4x faster on GPU, and 5x - 11x faster on CPU than other detectors.
  • Figure 5: Softmax vs. fast normalized feature fusion -- (a) - (c) shows normalized weights (i.e., importance) during training for three representative nodes; each node has two inputs (input1 & input2) and their normalized weights always sum up to 1.
  • ...and 4 more figures