Table of Contents
Fetching ...

Diff3DETR:Agent-based Diffusion Model for Semi-supervised 3D Object Detection

Jiacheng Deng, Jiahao Lu, Tianzhu Zhang

TL;DR

Diff3DETR tackles semi-supervised 3D object detection by integrating diffusion-based pseudo-label generation into a DETR framework under a mean-teacher setup. It introduces an agent-based object query generator to balance sampling locations and content embeddings, and a box-aware denoising module that leverages DDIM denoising and long-range transformer attention to progressively refine noisy boxes. The approach yields diverse and high-quality pseudo-labels and improved bounding box accuracy, outperforming state-of-the-art methods on ScanNet and SUN RGB-D with limited labeled data. This work demonstrates that diffusion-based DETR architectures can effectively leverage unlabeled 3D data for robust scene understanding with practical impact on indoor robotics and AR/VR.

Abstract

3D object detection is essential for understanding 3D scenes. Contemporary techniques often require extensive annotated training data, yet obtaining point-wise annotations for point clouds is time-consuming and laborious. Recent developments in semi-supervised methods seek to mitigate this problem by employing a teacher-student framework to generate pseudo-labels for unlabeled point clouds. However, these pseudo-labels frequently suffer from insufficient diversity and inferior quality. To overcome these hurdles, we introduce an Agent-based Diffusion Model for Semi-supervised 3D Object Detection (Diff3DETR). Specifically, an agent-based object query generator is designed to produce object queries that effectively adapt to dynamic scenes while striking a balance between sampling locations and content embedding. Additionally, a box-aware denoising module utilizes the DDIM denoising process and the long-range attention in the transformer decoder to refine bounding boxes incrementally. Extensive experiments on ScanNet and SUN RGB-D datasets demonstrate that Diff3DETR outperforms state-of-the-art semi-supervised 3D object detection methods.

Diff3DETR:Agent-based Diffusion Model for Semi-supervised 3D Object Detection

TL;DR

Diff3DETR tackles semi-supervised 3D object detection by integrating diffusion-based pseudo-label generation into a DETR framework under a mean-teacher setup. It introduces an agent-based object query generator to balance sampling locations and content embeddings, and a box-aware denoising module that leverages DDIM denoising and long-range transformer attention to progressively refine noisy boxes. The approach yields diverse and high-quality pseudo-labels and improved bounding box accuracy, outperforming state-of-the-art methods on ScanNet and SUN RGB-D with limited labeled data. This work demonstrates that diffusion-based DETR architectures can effectively leverage unlabeled 3D data for robust scene understanding with practical impact on indoor robotics and AR/VR.

Abstract

3D object detection is essential for understanding 3D scenes. Contemporary techniques often require extensive annotated training data, yet obtaining point-wise annotations for point clouds is time-consuming and laborious. Recent developments in semi-supervised methods seek to mitigate this problem by employing a teacher-student framework to generate pseudo-labels for unlabeled point clouds. However, these pseudo-labels frequently suffer from insufficient diversity and inferior quality. To overcome these hurdles, we introduce an Agent-based Diffusion Model for Semi-supervised 3D Object Detection (Diff3DETR). Specifically, an agent-based object query generator is designed to produce object queries that effectively adapt to dynamic scenes while striking a balance between sampling locations and content embedding. Additionally, a box-aware denoising module utilizes the DDIM denoising process and the long-range attention in the transformer decoder to refine bounding boxes incrementally. Extensive experiments on ScanNet and SUN RGB-D datasets demonstrate that Diff3DETR outperforms state-of-the-art semi-supervised 3D object detection methods.
Paper Structure (14 sections, 8 equations, 7 figures, 5 tables, 2 algorithms)

This paper contains 14 sections, 8 equations, 7 figures, 5 tables, 2 algorithms.

Figures (7)

  • Figure 1: (a) presents three candidate generation modes: Farthest Point Sampling (FPS), learnable object query, and ours. Our candidate generation mode simultaneously considers the distribution of sampling locations and the learning of content information. (b) displays the geometric differences between initial boxes (in red) and ground truth boxes (in green) in two scenes, highlighting the importance of aggregating features from the correct areas for 3D object detection.
  • Figure 2: The framework of Diff3DETR. Diff3DETR adopts the framework of the mean teacher tarvainen2017mean, consisting of a student model and a teacher model. The student and teacher models start from GT/pseudo boxes and Gaussian noise, respectively, gradually adding noise to generate noisy boxes and ultimately predicting the accurate object boxes through the Diff3DETR detector. The student model updates its parameters under the supervision of ground truths and pseudo-labels, while the teacher model updates its parameters using an Exponential Moving Average (EMA) strategy.
  • Figure 3: The overall architecture of the Diff3DETR detector. Input point clouds undergo downsampling and feature extraction, with FPS selecting noisy centers. The agent-based object query generator sets learnable agents interacting with scene features and generates object queries through trilinear interpolation with these centers. Concurrently, noisy boxes initialized with Gaussian noise and object queries are processed in the box-aware denoising module. This module updates queries and predicts boxes, aided by the DDIM layer for iterative denoising.
  • Figure 4: The architecture of the decoder layer.
  • Figure 5: The qualitative results on ScanNet and SUN RGB-D datasets.
  • ...and 2 more figures