Table of Contents
Fetching ...

ConsistencyDet: A Few-step Denoising Framework for Object Detection Using the Consistency Model

Lifan Jiang, Zhihui Wang, Changmiao Wang, Ming Li, Jiaxu Leng

TL;DR

Object detection is reframed as a denoising diffusion process on bounding boxes using a Consistency Model. ConsistencyDet enables few-step denoising by leveraging self-consistency, achieving faster inference than traditional diffusion-based detectors while maintaining strong accuracy. The authors present distillation-from-DiffusionDet and independent training strategies, demonstrating competitive or superior results on MS-COCO and LVIS with multiple backbones and a notable speed-accuracy trade-off. This approach offers practical benefits for real-time detection and robustness to varying proposal counts, and suggests extensions to segmentation and tracking in future work.

Abstract

Object detection, a quintessential task in the realm of perceptual computing, can be tackled using a generative methodology. In the present study, we introduce a novel framework designed to articulate object detection as a denoising diffusion process, which operates on the perturbed bounding boxes of annotated entities. This framework, termed \textbf{ConsistencyDet}, leverages an innovative denoising concept known as the Consistency Model. The hallmark of this model is its self-consistency feature, which empowers the model to map distorted information from any time step back to its pristine state, thereby realizing a \textbf{``few-step denoising''} mechanism. Such an attribute markedly elevates the operational efficiency of the model, setting it apart from the conventional Diffusion Model. Throughout the training phase, ConsistencyDet initiates the diffusion sequence with noise-infused boxes derived from the ground-truth annotations and conditions the model to perform the denoising task. Subsequently, in the inference stage, the model employs a denoising sampling strategy that commences with bounding boxes randomly sampled from a normal distribution. Through iterative refinement, the model transforms an assortment of arbitrarily generated boxes into definitive detections. Comprehensive evaluations employing standard benchmarks, such as MS-COCO and LVIS, corroborate that ConsistencyDet surpasses other leading-edge detectors in performance metrics. Our code is available at https://anonymous.4open.science/r/ConsistencyDet-37D5.

ConsistencyDet: A Few-step Denoising Framework for Object Detection Using the Consistency Model

TL;DR

Object detection is reframed as a denoising diffusion process on bounding boxes using a Consistency Model. ConsistencyDet enables few-step denoising by leveraging self-consistency, achieving faster inference than traditional diffusion-based detectors while maintaining strong accuracy. The authors present distillation-from-DiffusionDet and independent training strategies, demonstrating competitive or superior results on MS-COCO and LVIS with multiple backbones and a notable speed-accuracy trade-off. This approach offers practical benefits for real-time detection and robustness to varying proposal counts, and suggests extensions to segmentation and tracking in future work.

Abstract

Object detection, a quintessential task in the realm of perceptual computing, can be tackled using a generative methodology. In the present study, we introduce a novel framework designed to articulate object detection as a denoising diffusion process, which operates on the perturbed bounding boxes of annotated entities. This framework, termed \textbf{ConsistencyDet}, leverages an innovative denoising concept known as the Consistency Model. The hallmark of this model is its self-consistency feature, which empowers the model to map distorted information from any time step back to its pristine state, thereby realizing a \textbf{``few-step denoising''} mechanism. Such an attribute markedly elevates the operational efficiency of the model, setting it apart from the conventional Diffusion Model. Throughout the training phase, ConsistencyDet initiates the diffusion sequence with noise-infused boxes derived from the ground-truth annotations and conditions the model to perform the denoising task. Subsequently, in the inference stage, the model employs a denoising sampling strategy that commences with bounding boxes randomly sampled from a normal distribution. Through iterative refinement, the model transforms an assortment of arbitrarily generated boxes into definitive detections. Comprehensive evaluations employing standard benchmarks, such as MS-COCO and LVIS, corroborate that ConsistencyDet surpasses other leading-edge detectors in performance metrics. Our code is available at https://anonymous.4open.science/r/ConsistencyDet-37D5.
Paper Structure (17 sections, 11 equations, 9 figures, 5 tables, 5 algorithms)

This paper contains 17 sections, 11 equations, 9 figures, 5 tables, 5 algorithms.

Figures (9)

  • Figure 1: Comparisons of denoising strategies of the Diffusion Model and Consistency Model for object detection. Object detection can be regarded as a denoising diffusion process from noisy boxes to object boxes. In the Diffusion Model, $q(\cdot|\cdot)$ is the diffusion process and $p_{\theta}(\cdot |\cdot)$ is the reverse process with a stepwise denoising operation. In the Consistency Model, $f_{\theta}(\cdot,\cdot)$ represents a one-step denoising process.
  • Figure 2: Consistency Model undergoes training process to establish a mapping that brings points along any trajectory of the PF ODE back to the origin of that trajectory song2023consistency.
  • Figure 3: Training procedures of the proposed ConsistencyDet. After extracting features through the backbone, random Gaussian noise is added to GT boxes following the Consistency Model's noise addition strategy. These noised boxes with corresponding features processed by RoI pooler are then input to ConsistencyHead for iterative noise removal, with several basic modules, ultimately yielding the final detection results. Each the basic module contains a self-attention mechanism (SA), dynamic convolutional layers (DC) and the head (HD) of classification and regression.
  • Figure 4: To ensure the self-consistency property of the Consistency Model, it requires feeding the noised boxes corresponding to $(t-1)$-th and $t$-th time steps into the model simultaneously. They jointly predict the final results, then compare them with GT to estimate the training loss.
  • Figure 5: Performance analysis of progressive refinement. ConsistencyDet is evaluated with varied sampling time steps. It is trained on the MS-COCO dataset with ResNet-50 as the backbone, using 500 proposal boxes for evaluation at each sampling time step. Simultaneously, the accuracy of ConsistencyDet surpasses that of DiffusionDet under the same experimental parameters. Here, the results of AP are all percentage data (%).
  • ...and 4 more figures