Table of Contents
Fetching ...

CatFree3D: Category-agnostic 3D Object Detection with Diffusion

Wenjing Bian, Zirui Wang, Andrea Vedaldi

TL;DR

CatFree3D introduces a diffusion-based, category-agnostic 3D object detector that decouples 3D detection from 2D detection and depth estimation. It uses a conditional diffusion process initialized from noise to recover 3D bounding box parameters, conditioned on an image, a 2D box, intrinsics, and depth. A second network estimates a confidence score for multiple proposals, with η = e^{-μ}, and the final prediction is selected by the highest η. To enable precise evaluation, it introduces Normalised Hungarian Distance (NHD) defined as NHD = (1/d_{gt}) Σ_i ||a_i - b_{P(i)}||_2, where P is the optimal 1-to-1 corner mapping and d_{gt} is the diagonal of the ground-truth box; NHD is scale-invariant and more informative than IoU/GIoU for thin objects. Experiments show state-of-the-art accuracy and strong cross-dataset generalisation, with practical benefits for 3D data annotation and deployment.

Abstract

Image-based 3D object detection is widely employed in applications such as autonomous vehicles and robotics, yet current systems struggle with generalisation due to complex problem setup and limited training data. We introduce a novel pipeline that decouples 3D detection from 2D detection and depth prediction, using a diffusion-based approach to improve accuracy and support category-agnostic detection. Additionally, we introduce the Normalised Hungarian Distance (NHD) metric for an accurate evaluation of 3D detection results, addressing the limitations of traditional IoU and GIoU metrics. Experimental results demonstrate that our method achieves state-of-the-art accuracy and strong generalisation across various object categories and datasets.

CatFree3D: Category-agnostic 3D Object Detection with Diffusion

TL;DR

CatFree3D introduces a diffusion-based, category-agnostic 3D object detector that decouples 3D detection from 2D detection and depth estimation. It uses a conditional diffusion process initialized from noise to recover 3D bounding box parameters, conditioned on an image, a 2D box, intrinsics, and depth. A second network estimates a confidence score for multiple proposals, with η = e^{-μ}, and the final prediction is selected by the highest η. To enable precise evaluation, it introduces Normalised Hungarian Distance (NHD) defined as NHD = (1/d_{gt}) Σ_i ||a_i - b_{P(i)}||_2, where P is the optimal 1-to-1 corner mapping and d_{gt} is the diagonal of the ground-truth box; NHD is scale-invariant and more informative than IoU/GIoU for thin objects. Experiments show state-of-the-art accuracy and strong cross-dataset generalisation, with practical benefits for 3D data annotation and deployment.

Abstract

Image-based 3D object detection is widely employed in applications such as autonomous vehicles and robotics, yet current systems struggle with generalisation due to complex problem setup and limited training data. We introduce a novel pipeline that decouples 3D detection from 2D detection and depth prediction, using a diffusion-based approach to improve accuracy and support category-agnostic detection. Additionally, we introduce the Normalised Hungarian Distance (NHD) metric for an accurate evaluation of 3D detection results, addressing the limitations of traditional IoU and GIoU metrics. Experimental results demonstrate that our method achieves state-of-the-art accuracy and strong generalisation across various object categories and datasets.
Paper Structure (25 sections, 16 equations, 12 figures, 7 tables, 2 algorithms)

This paper contains 25 sections, 16 equations, 12 figures, 7 tables, 2 algorithms.

Figures (12)

  • Figure 1: Method Overview. During forward diffusion, we add $N$ independent Gaussian noises to a ground truth box $\mathbf{x}_0$ to obtain a number of noisy boxes. We then train a denoising network $f_\theta$ to recover the target box parameters $\hat{\mathbf{x}}_0$ from noisy boxes, conditioned on a vision-related signal $\mathbf{c}$. Additionally, we train another network $f_\phi$ to estimate a confidence score $\eta$ for each predicted box. The final output is the box with the highest confidence score.
  • Figure 2: Comparison Between NHD, IoU and GIoU. Comparing block (a) with blocks (b, c, d, e, f), we show that NHD provides a more accurate measurement of errors compared to IoU and GIoU, particularly under translation, scaling, and rotation transformations. Block (g) demonstrates that all three metrics are scale-invariant. Block (h) presents metric values when the two boxes are perfectly aligned.
  • Figure 3: IoU and NHD in a Practical Example. For thin objects like mirrors, even a small translational offset can lead to an IoU of 0. In contrast, NHD effectively captures and reflects the box estimation error in these cases.
  • Figure 4: Detection Performance: Results on Omni3D Test Set. Estimating 3D box from GT 2D boxes and GT object depths.
  • Figure 5: Generalisation Performance: Results for In-the-Wild Objects on COCO Dataset. We show predictions made by our method without knowing object depths or camera intrinsics. By using constant values for depths and camera intrinsics, our approach accurately predicts 3D boxes with well-aligned projections on the image.
  • ...and 7 more figures