Table of Contents
Fetching ...

Ray Denoising: Depth-aware Hard Negative Sampling for Multi-view 3D Object Detection

Feng Liu, Tengteng Huang, Qianjing Zhang, Haotian Yao, Chi Zhang, Fang Wan, Qixiang Ye, Yanzhao Zhou

TL;DR

Ray Denoising is an innovative method that enhances detection accuracy by strategically sampling along camera rays to construct hard negative examples, compel the model to learn depth-aware features, thereby improving its capacity to distinguish between true and false positives.

Abstract

Multi-view 3D object detection systems often struggle with generating precise predictions due to the challenges in estimating depth from images, increasing redundant and incorrect detections. Our paper presents Ray Denoising, an innovative method that enhances detection accuracy by strategically sampling along camera rays to construct hard negative examples. These examples, visually challenging to differentiate from true positives, compel the model to learn depth-aware features, thereby improving its capacity to distinguish between true and false positives. Ray Denoising is designed as a plug-and-play module, compatible with any DETR-style multi-view 3D detectors, and it only minimally increases training computational costs without affecting inference speed. Our comprehensive experiments, including detailed ablation studies, consistently demonstrate that Ray Denoising outperforms strong baselines across multiple datasets. It achieves a 1.9\% improvement in mean Average Precision (mAP) over the state-of-the-art StreamPETR method on the NuScenes dataset. It shows significant performance gains on the Argoverse 2 dataset, highlighting its generalization capability. The code will be available at https://github.com/LiewFeng/RayDN.

Ray Denoising: Depth-aware Hard Negative Sampling for Multi-view 3D Object Detection

TL;DR

Ray Denoising is an innovative method that enhances detection accuracy by strategically sampling along camera rays to construct hard negative examples, compel the model to learn depth-aware features, thereby improving its capacity to distinguish between true and false positives.

Abstract

Multi-view 3D object detection systems often struggle with generating precise predictions due to the challenges in estimating depth from images, increasing redundant and incorrect detections. Our paper presents Ray Denoising, an innovative method that enhances detection accuracy by strategically sampling along camera rays to construct hard negative examples. These examples, visually challenging to differentiate from true positives, compel the model to learn depth-aware features, thereby improving its capacity to distinguish between true and false positives. Ray Denoising is designed as a plug-and-play module, compatible with any DETR-style multi-view 3D detectors, and it only minimally increases training computational costs without affecting inference speed. Our comprehensive experiments, including detailed ablation studies, consistently demonstrate that Ray Denoising outperforms strong baselines across multiple datasets. It achieves a 1.9\% improvement in mean Average Precision (mAP) over the state-of-the-art StreamPETR method on the NuScenes dataset. It shows significant performance gains on the Argoverse 2 dataset, highlighting its generalization capability. The code will be available at https://github.com/LiewFeng/RayDN.
Paper Structure (17 sections, 5 equations, 6 figures, 10 tables)

This paper contains 17 sections, 5 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: The challenge of estimating depth from images in multi-view 3D object detection leads to duplicate predictions and false positive detections along camera rays. Best viewed in color.
  • Figure 2: The proposed Ray Denoising approach (right) effectively reduces false positive detections along the ray (highlighted by red rectangles) in the previous state-of-the-art method StreamPETR Wang_2023_ICCV (left). Best viewed by zooming on the screen.
  • Figure 3: Overall framework of the Ray Denoising approach, a plug-and-play training technique for DETR-style multi-view 3D object detectors, focuses on refining the model's ability to distinguish true positives from false positives in depth. Casting rays and sampling depth-aware denoising queries effectively tackle the challenge of false positives arising from the inherent difficulties in visually estimating depth, leading to substantial improvements in detection performance over strong baselines. Best viewed in color and by zooming on the screen.
  • Figure 4: (a) Distribution comparison showing that the Beta distribution is bounded between -1 and 1, unlike the Laplace and Gaussian distributions, which are unbounded. (b) The Beta distribution family, with the x-range adjusted from $[0,1]$ to $[-1,1]$ using the transformation $y=2x-1$. Best viewed in color.
  • Figure 5: (a) Visualization of the precision-recall curves at various distance thresholds. Ray Denoising consistently enhances precision across nearly all recall levels, effectively suppressing false positives. (b) Class-wise AP comparison. Ray Denoising performs superior over the SOTA StreamPETR in all object classes. Best viewed in color.
  • ...and 1 more figures