Table of Contents
Fetching ...

UniDet3D: Multi-dataset Indoor 3D Object Detection

Maksim Kolodiazhnyi, Anna Vorontsova, Matvey Skripkin, Danila Rukhovich, Anton Konushin

TL;DR

A simple yet effective 3D object detection model, trained on a mixture of indoor datasets and is capable of working in various indoor environments is proposed, which obtains significant gains over existing 3D object detection methods.

Abstract

Growing customer demand for smart solutions in robotics and augmented reality has attracted considerable attention to 3D object detection from point clouds. Yet, existing indoor datasets taken individually are too small and insufficiently diverse to train a powerful and general 3D object detection model. In the meantime, more general approaches utilizing foundation models are still inferior in quality to those based on supervised training for a specific task. In this work, we propose \ours{}, a simple yet effective 3D object detection model, which is trained on a mixture of indoor datasets and is capable of working in various indoor environments. By unifying different label spaces, \ours{} enables learning a strong representation across multiple datasets through a supervised joint training scheme. The proposed network architecture is built upon a vanilla transformer encoder, making it easy to run, customize and extend the prediction pipeline for practical use. Extensive experiments demonstrate that \ours{} obtains significant gains over existing 3D object detection methods in 6 indoor benchmarks: ScanNet (+1.1 mAP50), ARKitScenes (+19.4 mAP25), S3DIS (+9.1 mAP50), MultiScan (+9.3 mAP50), 3RScan (+3.2 mAP50), and ScanNet++ (+2.7 mAP50). Code is available at https://github.com/filapro/unidet3d .

UniDet3D: Multi-dataset Indoor 3D Object Detection

TL;DR

A simple yet effective 3D object detection model, trained on a mixture of indoor datasets and is capable of working in various indoor environments is proposed, which obtains significant gains over existing 3D object detection methods.

Abstract

Growing customer demand for smart solutions in robotics and augmented reality has attracted considerable attention to 3D object detection from point clouds. Yet, existing indoor datasets taken individually are too small and insufficiently diverse to train a powerful and general 3D object detection model. In the meantime, more general approaches utilizing foundation models are still inferior in quality to those based on supervised training for a specific task. In this work, we propose \ours{}, a simple yet effective 3D object detection model, which is trained on a mixture of indoor datasets and is capable of working in various indoor environments. By unifying different label spaces, \ours{} enables learning a strong representation across multiple datasets through a supervised joint training scheme. The proposed network architecture is built upon a vanilla transformer encoder, making it easy to run, customize and extend the prediction pipeline for practical use. Extensive experiments demonstrate that \ours{} obtains significant gains over existing 3D object detection methods in 6 indoor benchmarks: ScanNet (+1.1 mAP50), ARKitScenes (+19.4 mAP25), S3DIS (+9.1 mAP50), MultiScan (+9.3 mAP50), 3RScan (+3.2 mAP50), and ScanNet++ (+2.7 mAP50). Code is available at https://github.com/filapro/unidet3d .
Paper Structure (46 sections, 7 equations, 4 figures, 9 tables)

This paper contains 46 sections, 7 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Existing 3D object detection methods use different architectures and weights to achieve state-of-the-art metrics on different datasets. We propose UniDet3D trained single time on a mixture of datasets and achieving even better results.
  • Figure 2: Three common ways of handling heterogeneous label spaces for training. The partitioned scheme implies using a separate classification head for each dataset. UniDet3D follows the unified scheme, using the same de-duplicated set of labels during both the training and inference.
  • Figure 3: Overview of the proposed method. UniDet3D takes the point cloud as an input, and extracts point features using a sparse 3D U-Net network. Point features are averaged across superpoints in the superpoint pooling. Aggregated features serve as input queries to a vanilla transformer encoder. Finally, 3D bounding boxes are derived from encoder outputs with a box MLP and class MLP, where box MLP estimates the location of a 3D bounding box w.r.t. the mass center of the superpoint, and class MLP outputs probabilities of object classes in the unified label space.
  • Figure 4: Comparison with existing transformed-based 3D object detection methods. We introduce encoder-only transformer architecture w/o positional encoding for queries or attention layers. This allows us to change unstable Hungarian matching for a simpler disentangled scheme.