Table of Contents
Fetching ...

MVSDet: Multi-View Indoor 3D Object Detection via Efficient Plane Sweeps

Yating Xu, Chen Li, Gim Hee Lee

TL;DR

This paper designs a probabilistic sampling and soft weighting mechanism to decide the placement of pixel features on the 3D volume and applies recent pixel-aligned Gaussian Splatting to regularize depth prediction and improve detection performance with little computation overhead.

Abstract

The key challenge of multi-view indoor 3D object detection is to infer accurate geometry information from images for precise 3D detection. Previous method relies on NeRF for geometry reasoning. However, the geometry extracted from NeRF is generally inaccurate, which leads to sub-optimal detection performance. In this paper, we propose MVSDet which utilizes plane sweep for geometry-aware 3D object detection. To circumvent the requirement for a large number of depth planes for accurate depth prediction, we design a probabilistic sampling and soft weighting mechanism to decide the placement of pixel features on the 3D volume. We select multiple locations that score top in the probability volume for each pixel and use their probability score to indicate the confidence. We further apply recent pixel-aligned Gaussian Splatting to regularize depth prediction and improve detection performance with little computation overhead. Extensive experiments on ScanNet and ARKitScenes datasets are conducted to show the superiority of our model. Our code is available at https://github.com/Pixie8888/MVSDet.

MVSDet: Multi-View Indoor 3D Object Detection via Efficient Plane Sweeps

TL;DR

This paper designs a probabilistic sampling and soft weighting mechanism to decide the placement of pixel features on the 3D volume and applies recent pixel-aligned Gaussian Splatting to regularize depth prediction and improve detection performance with little computation overhead.

Abstract

The key challenge of multi-view indoor 3D object detection is to infer accurate geometry information from images for precise 3D detection. Previous method relies on NeRF for geometry reasoning. However, the geometry extracted from NeRF is generally inaccurate, which leads to sub-optimal detection performance. In this paper, we propose MVSDet which utilizes plane sweep for geometry-aware 3D object detection. To circumvent the requirement for a large number of depth planes for accurate depth prediction, we design a probabilistic sampling and soft weighting mechanism to decide the placement of pixel features on the 3D volume. We select multiple locations that score top in the probability volume for each pixel and use their probability score to indicate the confidence. We further apply recent pixel-aligned Gaussian Splatting to regularize depth prediction and improve detection performance with little computation overhead. Extensive experiments on ScanNet and ARKitScenes datasets are conducted to show the superiority of our model. Our code is available at https://github.com/Pixie8888/MVSDet.

Paper Structure

This paper contains 30 sections, 11 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Comparison with NeRF-Det xu2023nerf. The 3D voxel centers (grey dots) are overlaid with the reference scene. The red dots denotes the erroneous backprojection pixel features to the points in the free space. Compared to NeRF-Det, we show much less inaccurate backprojections.
  • Figure 2: Overview of our MVSDet. The upper branch shows the detection pipeline with our proposed probabilistic sampling and soft weighting. The backprojected ray intersects at 3 points (shown as dots), but only the green point receives the pixel feature based on the selected depth proposals. The red points are denoted as invalid backprojection location and thus the pixel feature is not assigned to them. "GT Location" is the ground truth 3D location of the pixel. The lower branch shows the pixel-aligned Gaussian Splatting (PAGS). We select nearby views for the novel image from the images input to the detection branch and predict Gaussian maps on them. Note that PAGS is removed during testing.
  • Figure 3: Comparison of different feature backprojection methods. The pixel ray intersects at 4 voxel centers with the blue box denoting the ground truth 3D location of the pixel. Our method computes the placement of the pixel features based on the depth probability distribution (purple) and thus able to suppress incorrect intersections.
  • Figure 4: Qualitative comparison on ScanNet dataset. Note that the mesh is not the input to the model and is only for visualization purpose.
  • Figure 5: Depth map visualization. "GT Depth" denotes ground truth depth map. Both "w/ Gaussian" and "w/o Gaussian" use $M=12$ depth planes.
  • ...and 3 more figures