DiffusionDet: Diffusion Model for Object Detection

Shoufa Chen; Peize Sun; Yibing Song; Ping Luo

DiffusionDet: Diffusion Model for Object Detection

Shoufa Chen, Peize Sun, Yibing Song, Ping Luo

TL;DR

<3-5 sentence high-level summary>

Abstract

We propose DiffusionDet, a new framework that formulates object detection as a denoising diffusion process from noisy boxes to object boxes. During the training stage, object boxes diffuse from ground-truth boxes to random distribution, and the model learns to reverse this noising process. In inference, the model refines a set of randomly generated boxes to the output results in a progressive way. Our work possesses an appealing property of flexibility, which enables the dynamic number of boxes and iterative evaluation. The extensive experiments on the standard benchmarks show that DiffusionDet achieves favorable performance compared to previous well-established detectors. For example, DiffusionDet achieves 5.3 AP and 4.8 AP gains when evaluated with more boxes and iteration steps, under a zero-shot transfer setting from COCO to CrowdHuman. Our code is available at https://github.com/ShoufaChen/DiffusionDet.

DiffusionDet: Diffusion Model for Object Detection

TL;DR

<3-5 sentence high-level summary>

Abstract

Paper Structure (47 sections, 10 equations, 6 figures, 11 tables, 2 algorithms)

This paper contains 47 sections, 10 equations, 6 figures, 11 tables, 2 algorithms.

Introduction
Related Work
Object detection.
Diffusion model.
Diffusion model for perception tasks.
Approach
Preliminaries
Object detection.
Diffusion model.
Architecture
Image encoder.
Detection decoder.
Training
Ground truth boxes padding.
Box corruption.
...and 32 more sections

Figures (6)

Figure 1: Diffusion model for object detection. (a) A diffusion model where $q$ is the diffusion process and $p_\theta$ is the reverse process. (b) Diffusion model for image generation task. (c) We propose to formulate object detection as a denoising diffusion process from noisy boxes to object boxes.
Figure 2: DiffusionDet framework.(a) The image encoder extracts feature representation from an input image. The detection decoder takes noisy boxes as input and predicts category classification and box coordinates. (b) The detection decoder has 6 stages in one detection head, following DETR and Sparse R-CNN. Besides, DiffusionDet can reuse this detection head (with 6 stages) multiple times, which is called "iterative evaluation".
Figure 3: Flexibility of DiffusionDet. All experiments are trained on COCO 2017 train set and evaluated on COCO 2017 val set. DiffusionDet uses the same network parameters for all settings in Figure \ref{['fig:eval_boxes']} and \ref{['fig:sample_steps']}. Our proposed DiffusionDet is able to benefit from more proposal boxes and iteration steps using the same network parameters.
Figure 4: Statistical results over 5 independent training instances, each is evaluated 10 times with different random seeds.
Figure 5: Dynamic number of boxes. All models are trained with 300 candidates (i.e., learnable queries or random boxes). When $N_{train} > N_{eval}$, we directly choose $N_{eval}$ from $N_{train}$ candidates; when $N_{train} < N_{eval}$, we design two strategies, i.e., clone and concat random.
...and 1 more figures

DiffusionDet: Diffusion Model for Object Detection

TL;DR

Abstract

DiffusionDet: Diffusion Model for Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (6)