Table of Contents
Fetching ...

NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection

Golnaz Ghiasi, Tsung-Yi Lin, Ruoming Pang, Quoc V. Le

TL;DR

The paper tackles the challenge of manually designing feature pyramids for multi-scale object detection by introducing NAS-FPN, a neural architecture search framework that learns cross-scale fusion patterns. It defines a modular, repeatable merging-cell search space and uses a PPO-based controller with a proxy task to discover scalable NAS-FPN architectures that can be stacked and adapted to different backbones. Empirical results show NAS-FPN achieves superior accuracy/latency tradeoffs across backbones, including 48.3 AP with AmoebaNet-D and strong mobile performance with NAS-FPNLite, often surpassing methods like Mask R-CNN with less computation. The work also demonstrates the potential for anytime detection via deep supervision and improves regularization with DropBlock, highlighting NAS-FPN as a versatile, scalable approach for scalable object detection.

Abstract

Current state-of-the-art convolutional architectures for object detection are manually designed. Here we aim to learn a better architecture of feature pyramid network for object detection. We adopt Neural Architecture Search and discover a new feature pyramid architecture in a novel scalable search space covering all cross-scale connections. The discovered architecture, named NAS-FPN, consists of a combination of top-down and bottom-up connections to fuse features across scales. NAS-FPN, combined with various backbone models in the RetinaNet framework, achieves better accuracy and latency tradeoff compared to state-of-the-art object detection models. NAS-FPN improves mobile detection accuracy by 2 AP compared to state-of-the-art SSDLite with MobileNetV2 model in [32] and achieves 48.3 AP which surpasses Mask R-CNN [10] detection accuracy with less computation time.

NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection

TL;DR

The paper tackles the challenge of manually designing feature pyramids for multi-scale object detection by introducing NAS-FPN, a neural architecture search framework that learns cross-scale fusion patterns. It defines a modular, repeatable merging-cell search space and uses a PPO-based controller with a proxy task to discover scalable NAS-FPN architectures that can be stacked and adapted to different backbones. Empirical results show NAS-FPN achieves superior accuracy/latency tradeoffs across backbones, including 48.3 AP with AmoebaNet-D and strong mobile performance with NAS-FPNLite, often surpassing methods like Mask R-CNN with less computation. The work also demonstrates the potential for anytime detection via deep supervision and improves regularization with DropBlock, highlighting NAS-FPN as a versatile, scalable approach for scalable object detection.

Abstract

Current state-of-the-art convolutional architectures for object detection are manually designed. Here we aim to learn a better architecture of feature pyramid network for object detection. We adopt Neural Architecture Search and discover a new feature pyramid architecture in a novel scalable search space covering all cross-scale connections. The discovered architecture, named NAS-FPN, consists of a combination of top-down and bottom-up connections to fuse features across scales. NAS-FPN, combined with various backbone models in the RetinaNet framework, achieves better accuracy and latency tradeoff compared to state-of-the-art object detection models. NAS-FPN improves mobile detection accuracy by 2 AP compared to state-of-the-art SSDLite with MobileNetV2 model in [32] and achieves 48.3 AP which surpasses Mask R-CNN [10] detection accuracy with less computation time.

Paper Structure

This paper contains 24 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: Average Precision vs. inference time per image across accurate models (top) and fast models (bottom) on mobile device. The green curve highlights results of NAS-FPN combined with RetinaNet. Please refer to Figure \ref{['fig:performance']} for details.
  • Figure 2: RetinaNet with NAS-FPN. In our proposal, feature pyramid network is to be searched by a neural architecture search algorithm. The backbone model and the subnets for class and box predictions follow the original design in RetinaNet lin2018focal. The architecture of FPN can be stacked $N$ times for better accuracy.
  • Figure 3: Four prediction steps required in a merging cell. Note the output feature layer is pushed back into the stack of candidate feature layers and available for selection for the next merging cell.
  • Figure 4: Binary operations.
  • Figure 5: Left: Rewards over RL training. The reward is computed as the AP of sampled architectures on the proxy task. Right: The number of sampled unique architectures to the total number of sampled architectures. As controller converges, more identical architectures are sampled by the controller.
  • ...and 6 more figures