Table of Contents
Fetching ...

FSSD: Feature Fusion Single Shot Multibox Detector

Zuoxin Li, Lu Yang, Fuqiang Zhou

TL;DR

FSSD tackles the SSD limitation in cross-scale feature fusion by introducing a lightweight feature fusion module that concatenates multi-layer ConvNet features, normalizes them, and creates a new pyramid for single-shot detection. The approach yields clear accuracy gains, especially for small objects, with only a modest slowdown, evidenced by VOC and COCO results that outperform SSD across both 300×300 and 512×512 configurations. On VOC07+12, FSSD300 reaches 78.8% mAP (82.7% with COCO pretraining), and VOC2012 results exceed SSD benchmarks (e.g., 82.0% vs 79.3% for 300×300). COCO test-dev shows notable improvements as well (27.1% AP for 300 and 31.8% for 512), while maintaining real-time inference speeds (65.8 FPS for 300×300 on a 1080Ti). The work suggests that stronger backbones and integration with other detection frameworks could further enhance performance and applicability.

Abstract

SSD (Single Shot Multibox Detector) is one of the best object detection algorithms with both high accuracy and fast speed. However, SSD's feature pyramid detection method makes it hard to fuse the features from different scales. In this paper, we proposed FSSD (Feature Fusion Single Shot Multibox Detector), an enhanced SSD with a novel and lightweight feature fusion module which can improve the performance significantly over SSD with just a little speed drop. In the feature fusion module, features from different layers with different scales are concatenated together, followed by some down-sampling blocks to generate new feature pyramid, which will be fed to multibox detectors to predict the final detection results. On the Pascal VOC 2007 test, our network can achieve 82.7 mAP (mean average precision) at the speed of 65.8 FPS (frame per second) with the input size 300$\times$300 using a single Nvidia 1080Ti GPU. In addition, our result on COCO is also better than the conventional SSD with a large margin. Our FSSD outperforms a lot of state-of-the-art object detection algorithms in both aspects of accuracy and speed. Code is available at https://github.com/lzx1413/CAFFE_SSD/tree/fssd.

FSSD: Feature Fusion Single Shot Multibox Detector

TL;DR

FSSD tackles the SSD limitation in cross-scale feature fusion by introducing a lightweight feature fusion module that concatenates multi-layer ConvNet features, normalizes them, and creates a new pyramid for single-shot detection. The approach yields clear accuracy gains, especially for small objects, with only a modest slowdown, evidenced by VOC and COCO results that outperform SSD across both 300×300 and 512×512 configurations. On VOC07+12, FSSD300 reaches 78.8% mAP (82.7% with COCO pretraining), and VOC2012 results exceed SSD benchmarks (e.g., 82.0% vs 79.3% for 300×300). COCO test-dev shows notable improvements as well (27.1% AP for 300 and 31.8% for 512), while maintaining real-time inference speeds (65.8 FPS for 300×300 on a 1080Ti). The work suggests that stronger backbones and integration with other detection frameworks could further enhance performance and applicability.

Abstract

SSD (Single Shot Multibox Detector) is one of the best object detection algorithms with both high accuracy and fast speed. However, SSD's feature pyramid detection method makes it hard to fuse the features from different scales. In this paper, we proposed FSSD (Feature Fusion Single Shot Multibox Detector), an enhanced SSD with a novel and lightweight feature fusion module which can improve the performance significantly over SSD with just a little speed drop. In the feature fusion module, features from different layers with different scales are concatenated together, followed by some down-sampling blocks to generate new feature pyramid, which will be fed to multibox detectors to predict the final detection results. On the Pascal VOC 2007 test, our network can achieve 82.7 mAP (mean average precision) at the speed of 65.8 FPS (frame per second) with the input size 300300 using a single Nvidia 1080Ti GPU. In addition, our result on COCO is also better than the conventional SSD with a large margin. Our FSSD outperforms a lot of state-of-the-art object detection algorithms in both aspects of accuracy and speed. Code is available at https://github.com/lzx1413/CAFFE_SSD/tree/fssd.

Paper Structure

This paper contains 25 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: (a) Features are computed from images with different scales independently, which is an inefficient way. (b) Just one scale features are used to detect objects, which is used in some two stage detectors such as Faster R-CNN FasterRCNN and R-FCN RFCN. (c) Feature fusion method adopted by FPNSharpMask, features are fused from top to bottom layer by layer. (d) Use the feature pyramid generated from a ConvNet. The conventional SSD is one of the examples. (e) Our proposed feature fusion and feature pyramid generation method. Features from different layers with different scales are concatenated together first and used to generate a series of pyramid features later.
  • Figure 2: (a) is the SSD framework proposed in SSD, (b) is our F-SSD framework.
  • Figure 3: Pyramid feature generators for FSSD300. We use the feature maps from gray blobs to detect objects. In (a), the fusion feature map takes part in the object detection. In (b), we only detect objects on the feature maps after the fusion feature map. (c) We replace the simple group of Conv+ReLU with a bottleneck block which consists of two Conv+ReLU layers.
  • Figure 4: Training process comparison. The vertical axis denotes the mAP calculated on VOC2007 test set and the horizontal axis represents the training iterations. SSD means that training the conventional SSD model with the default settings from a pre-trained VGG16 model. FSSD means that we train the FSSD model with a pre-trained VGG model. FSSD's training parameters are the same with SSD. FSSD+ means that we train the FSSD from a pre-trained SSD model. The FSSD+ is only optimized for 60k iterations. All of the models are trained on VOC07+12 dataset.
  • Figure 5: Speed and accuracy distribution with different object detection algorithms. The speeds are all measured on a single Titan X GPU. As we do not have a Titan X GPU, the speed of our FSSDs is calculated by comparing with SSD's speed which we have tested on our own Nvidia 1080Ti.
  • ...and 1 more figures