Table of Contents
Fetching ...

MV-DETR: Multi-modality indoor object detection by Multi-View DEtecton TRansformers

Zichao Dong, Yilin Zhang, Xufeng Huang, Hang Ji, Zhan Shi, Xin Zhan, Junbo Chen

TL;DR

The paper tackles indoor 3D object detection from RGB-D data, where depth-derived geometry and RGB-derived texture offer complementary cues but texture pretraining is underutilized. It introduces MV-DETR, which uses separate geometry and visual texture encoders, a simple VG Connector for fusion, and a DETR-like decoder augmented with 3DV-RPE to focus attention on regions near predicted 3D bounding boxes. The method leverages strong RGB pretraining for texture and a lightweight fusion strategy to achieve state-of-the-art results on ScanNetv2, reaching 78.0 AP. This approach yields an efficient, accurate indoor perception pipeline suitable for embodied AI and downstream tasks in robotics and navigation.

Abstract

We introduce a novel MV-DETR pipeline which is effective while efficient transformer based detection method. Given input RGBD data, we notice that there are super strong pretraining weights for RGB data while less effective works for depth related data. First and foremost , we argue that geometry and texture cues are both of vital importance while could be encoded separately. Secondly, we find that visual texture feature is relatively hard to extract compared with geometry feature in 3d space. Unfortunately, single RGBD dataset with thousands of data is not enough for training an discriminating filter for visual texture feature extraction. Last but certainly not the least, we designed a lightweight VG module consists of a visual textual encoder, a geometry encoder and a VG connector. Compared with previous state of the art works like V-DETR, gains from pretrained visual encoder could be seen. Extensive experiments on ScanNetV2 dataset shows the effectiveness of our method. It is worth mentioned that our method achieve 78\% AP which create new state of the art on ScanNetv2 benchmark.

MV-DETR: Multi-modality indoor object detection by Multi-View DEtecton TRansformers

TL;DR

The paper tackles indoor 3D object detection from RGB-D data, where depth-derived geometry and RGB-derived texture offer complementary cues but texture pretraining is underutilized. It introduces MV-DETR, which uses separate geometry and visual texture encoders, a simple VG Connector for fusion, and a DETR-like decoder augmented with 3DV-RPE to focus attention on regions near predicted 3D bounding boxes. The method leverages strong RGB pretraining for texture and a lightweight fusion strategy to achieve state-of-the-art results on ScanNetv2, reaching 78.0 AP. This approach yields an efficient, accurate indoor perception pipeline suitable for embodied AI and downstream tasks in robotics and navigation.

Abstract

We introduce a novel MV-DETR pipeline which is effective while efficient transformer based detection method. Given input RGBD data, we notice that there are super strong pretraining weights for RGB data while less effective works for depth related data. First and foremost , we argue that geometry and texture cues are both of vital importance while could be encoded separately. Secondly, we find that visual texture feature is relatively hard to extract compared with geometry feature in 3d space. Unfortunately, single RGBD dataset with thousands of data is not enough for training an discriminating filter for visual texture feature extraction. Last but certainly not the least, we designed a lightweight VG module consists of a visual textual encoder, a geometry encoder and a VG connector. Compared with previous state of the art works like V-DETR, gains from pretrained visual encoder could be seen. Extensive experiments on ScanNetV2 dataset shows the effectiveness of our method. It is worth mentioned that our method achieve 78\% AP which create new state of the art on ScanNetv2 benchmark.
Paper Structure (22 sections, 1 equation, 2 figures, 1 table)

This paper contains 22 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: Pipeline of MV-DETR. The MV-DETR is mainly constructed by four main components: geometry encoder, visual texture encoder, connector and detection decoder.
  • Figure 2: Components of VG Connector. We only use simple linear layer as adaptor to fuse feature from multiple domain.