Table of Contents
Fetching ...

DSSD : Deconvolutional Single Shot Detector

Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, Alexander C. Berg

TL;DR

This paper presents DSSD, an enhancement of the SSD object detector that injects large-scale contextual information via a deconvolution-based encoder–decoder (hourglass) structure built on a Residual-101 backbone. It introduces a deconvolution module and a prediction/output framework to enable stable end-to-end training, along with a prediction-module refinement and training strategy. The approach yields state-of-the-art results on PASCAL VOC and COCO, notably improving small-object detection while maintaining competitive speed. The method demonstrates strong performance gains over prior single-network detectors and offers a generalizable mechanism for context integration in detection frameworks.

Abstract

The main contribution of this paper is an approach for introducing additional context into state-of-the-art general object detection. To achieve this we first combine a state-of-the-art classifier (Residual-101[14]) with a fast detection framework (SSD[18]). We then augment SSD+Residual-101 with deconvolution layers to introduce additional large-scale context in object detection and improve accuracy, especially for small objects, calling our resulting system DSSD for deconvolutional single shot detector. While these two contributions are easily described at a high-level, a naive implementation does not succeed. Instead we show that carefully adding additional stages of learned transformations, specifically a module for feed-forward connections in deconvolution and a new output module, enables this new approach and forms a potential way forward for further detection research. Results are shown on both PASCAL VOC and COCO detection. Our DSSD with $513 \times 513$ input achieves 81.5% mAP on VOC2007 test, 80.0% mAP on VOC2012 test, and 33.2% mAP on COCO, outperforming a state-of-the-art method R-FCN[3] on each dataset.

DSSD : Deconvolutional Single Shot Detector

TL;DR

This paper presents DSSD, an enhancement of the SSD object detector that injects large-scale contextual information via a deconvolution-based encoder–decoder (hourglass) structure built on a Residual-101 backbone. It introduces a deconvolution module and a prediction/output framework to enable stable end-to-end training, along with a prediction-module refinement and training strategy. The approach yields state-of-the-art results on PASCAL VOC and COCO, notably improving small-object detection while maintaining competitive speed. The method demonstrates strong performance gains over prior single-network detectors and offers a generalizable mechanism for context integration in detection frameworks.

Abstract

The main contribution of this paper is an approach for introducing additional context into state-of-the-art general object detection. To achieve this we first combine a state-of-the-art classifier (Residual-101[14]) with a fast detection framework (SSD[18]). We then augment SSD+Residual-101 with deconvolution layers to introduce additional large-scale context in object detection and improve accuracy, especially for small objects, calling our resulting system DSSD for deconvolutional single shot detector. While these two contributions are easily described at a high-level, a naive implementation does not succeed. Instead we show that carefully adding additional stages of learned transformations, specifically a module for feed-forward connections in deconvolution and a new output module, enables this new approach and forms a potential way forward for further detection research. Results are shown on both PASCAL VOC and COCO detection. Our DSSD with input achieves 81.5% mAP on VOC2007 test, 80.0% mAP on VOC2012 test, and 33.2% mAP on COCO, outperforming a state-of-the-art method R-FCN[3] on each dataset.

Paper Structure

This paper contains 8 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Networks of SSD and DSSD on residual network. The blue modules are the layers added in SSD framework, and we call them SSD Layers. In the bottom figure, the red layers are DSSD layers.
  • Figure 2: Variants of the prediction module
  • Figure 3: Deconvolution module
  • Figure 4: (a) on previous page. (b) above. Detection examples on COCO test-dev with SSD321/DSSD321 model. For each pair, the left side is the result of SSD and right side is the result of DSSD. We show detections with scores higher than 0.6. Each color corresponds to an object category.
  • Figure :