Table of Contents
Fetching ...

Learning Spatial Fusion for Single-Shot Object Detection

Songtao Liu, Di Huang, Yunhong Wang

TL;DR

Single-shot detectors suffer from cross-scale inconsistency in pyramidal features when handling objects at multiple scales. The authors propose adaptively spatial feature fusion (ASFF), a differentiable module that learns per-location fusion weights across pyramid levels after resizing, effectively filtering conflicting information. ASFF is lightweight, backbone-agnostic, and demonstrates substantial improvements on COCO for YOLOv3 and RetinaNet baselines with minimal inference cost, supported by gradient-flow analysis and thorough ablations. The approach yields state-of-the-art speed-accuracy trade-offs among real-time detectors and generalizes to multiple architectures. Overall, ASFF offers a practical solution to enhance scale-invariance in feature pyramids for real-time object detection.

Abstract

Pyramidal feature representation is the common practice to address the challenge of scale variation in object detection. However, the inconsistency across different feature scales is a primary limitation for the single-shot detectors based on feature pyramid. In this work, we propose a novel and data driven strategy for pyramidal feature fusion, referred to as adaptively spatial feature fusion (ASFF). It learns the way to spatially filter conflictive information to suppress the inconsistency, thus improving the scale-invariance of features, and introduces nearly free inference overhead. With the ASFF strategy and a solid baseline of YOLOv3, we achieve the best speed-accuracy trade-off on the MS COCO dataset, reporting 38.1% AP at 60 FPS, 42.4% AP at 45 FPS and 43.9% AP at 29 FPS. The code is available at https://github.com/ruinmessi/ASFF

Learning Spatial Fusion for Single-Shot Object Detection

TL;DR

Single-shot detectors suffer from cross-scale inconsistency in pyramidal features when handling objects at multiple scales. The authors propose adaptively spatial feature fusion (ASFF), a differentiable module that learns per-location fusion weights across pyramid levels after resizing, effectively filtering conflicting information. ASFF is lightweight, backbone-agnostic, and demonstrates substantial improvements on COCO for YOLOv3 and RetinaNet baselines with minimal inference cost, supported by gradient-flow analysis and thorough ablations. The approach yields state-of-the-art speed-accuracy trade-offs among real-time detectors and generalizes to multiple architectures. Overall, ASFF offers a practical solution to enhance scale-invariance in feature pyramids for real-time object detection.

Abstract

Pyramidal feature representation is the common practice to address the challenge of scale variation in object detection. However, the inconsistency across different feature scales is a primary limitation for the single-shot detectors based on feature pyramid. In this work, we propose a novel and data driven strategy for pyramidal feature fusion, referred to as adaptively spatial feature fusion (ASFF). It learns the way to spatially filter conflictive information to suppress the inconsistency, thus improving the scale-invariance of features, and introduces nearly free inference overhead. With the ASFF strategy and a solid baseline of YOLOv3, we achieve the best speed-accuracy trade-off on the MS COCO dataset, reporting 38.1% AP at 60 FPS, 42.4% AP at 45 FPS and 43.9% AP at 29 FPS. The code is available at https://github.com/ruinmessi/ASFF

Paper Structure

This paper contains 21 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Speed-accuracy trade-off on COCO test-dev for real-time detectors. The proposed ASFF helps YOLOv3 outperform a range of state-of-the-art algorithms.
  • Figure 2: Illustration of the adaptively spatial feature fusion mechanism. For each level, the features of all the other levels are resized to the same shape and spatially fused according to the learned weight maps.
  • Figure 3: Visualization of detection results on COCO val-2017 as well as the learned weight scalar maps at each level. We zoom in the heat maps of level 3 within the red box for better visualization.
  • Figure 4: More qualitative examples when one image has several objects with different sizes.