Table of Contents
Fetching ...

NeRF-DetS: Enhanced Adaptive Spatial-wise Sampling and View-wise Fusion Strategies for NeRF-based Indoor Multi-view 3D Object Detection

Chi Huang, Xinyang Li, Yansong Qu, Changli Wu, Xiaofan Li, Shengchuan Zhang, Liujuan Cao

TL;DR

This work tackles indoor 3D object detection from multi-view RGB images by leveraging NeRF-based representations. It introduces two key innovations: Progressive Adaptive Sampling Strategy (PASS), which adaptively refines sampling points across detector layers using learned offsets, and Depth-Guided Simplified Multi-Head Attention Fusion (DS-MHA), which fuses multi-view features in a depth-aware, computationally efficient manner with geometry guidance from the NeRF branch. Together, these components enhance feature continuity in 3D space and occlusion-aware fusion, yielding significant improvements over the NeRF-Det baseline on ScanNetV2 (+5.02% AP25, +5.92% AP50) and ARKITScenes. The approach demonstrates that integrating adaptive spatial sampling with depth-guided fusion can robustly exploit NeRF's strengths for end-to-end 3D perception in indoor scenes, with practical benefits in accuracy and efficiency.

Abstract

In indoor scenes, the diverse distribution of object locations and scales makes the visual 3D perception task a big challenge. Previous works (e.g, NeRF-Det) have demonstrated that implicit representation has the capacity to benefit the visual 3D perception task in indoor scenes with high amount of overlap between input images. However, previous works cannot fully utilize the advancement of implicit representation because of fixed sampling and simple multi-view feature fusion. In this paper, inspired by sparse fashion method (e.g, DETR3D), we propose a simple yet effective method, NeRF-DetS, to address above issues. NeRF-DetS includes two modules: Progressive Adaptive Sampling Strategy (PASS) and Depth-Guided Simplified Multi-Head Attention Fusion (DS-MHA). Specifically, (1)PASS can automatically sample features of each layer within a dense 3D detector, using offsets predicted by the previous layer. (2)DS-MHA can not only efficiently fuse multi-view features with strong occlusion awareness but also reduce computational cost. Extensive experiments on ScanNetV2 dataset demonstrate our NeRF-DetS outperforms NeRF-Det, by achieving +5.02% and +5.92% improvement in mAP under IoU25 and IoU50, respectively. Also, NeRF-DetS shows consistent improvements on ARKITScenes.

NeRF-DetS: Enhanced Adaptive Spatial-wise Sampling and View-wise Fusion Strategies for NeRF-based Indoor Multi-view 3D Object Detection

TL;DR

This work tackles indoor 3D object detection from multi-view RGB images by leveraging NeRF-based representations. It introduces two key innovations: Progressive Adaptive Sampling Strategy (PASS), which adaptively refines sampling points across detector layers using learned offsets, and Depth-Guided Simplified Multi-Head Attention Fusion (DS-MHA), which fuses multi-view features in a depth-aware, computationally efficient manner with geometry guidance from the NeRF branch. Together, these components enhance feature continuity in 3D space and occlusion-aware fusion, yielding significant improvements over the NeRF-Det baseline on ScanNetV2 (+5.02% AP25, +5.92% AP50) and ARKITScenes. The approach demonstrates that integrating adaptive spatial sampling with depth-guided fusion can robustly exploit NeRF's strengths for end-to-end 3D perception in indoor scenes, with practical benefits in accuracy and efficiency.

Abstract

In indoor scenes, the diverse distribution of object locations and scales makes the visual 3D perception task a big challenge. Previous works (e.g, NeRF-Det) have demonstrated that implicit representation has the capacity to benefit the visual 3D perception task in indoor scenes with high amount of overlap between input images. However, previous works cannot fully utilize the advancement of implicit representation because of fixed sampling and simple multi-view feature fusion. In this paper, inspired by sparse fashion method (e.g, DETR3D), we propose a simple yet effective method, NeRF-DetS, to address above issues. NeRF-DetS includes two modules: Progressive Adaptive Sampling Strategy (PASS) and Depth-Guided Simplified Multi-Head Attention Fusion (DS-MHA). Specifically, (1)PASS can automatically sample features of each layer within a dense 3D detector, using offsets predicted by the previous layer. (2)DS-MHA can not only efficiently fuse multi-view features with strong occlusion awareness but also reduce computational cost. Extensive experiments on ScanNetV2 dataset demonstrate our NeRF-DetS outperforms NeRF-Det, by achieving +5.02% and +5.92% improvement in mAP under IoU25 and IoU50, respectively. Also, NeRF-DetS shows consistent improvements on ARKITScenes.
Paper Structure (13 sections, 9 equations, 4 figures, 8 tables)

This paper contains 13 sections, 9 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Comparison between NeRF-DetS and other methods. There are two main steps in 3D object detection using 2D images: view fusion and space sampling strategies. We compare our method with NeRF-Det and the Sparse Fashion method.
  • Figure 2: Overview of NeRF-DetS. Progressive Adaptive Sampling Strategy includes Progressive Adaptive Sampling and the Layer Fusion. Progressive Adaptive Sampling updates the sampling points for subsequent layer and Layer Fusion fuses the feature from the previous layer. The blue part is the process of Depth-Guided Simplified Multi-Head Attention Fusion. The dashed lines represent our supervision. In the whole process, we not only use the geometry information from NeRF branch, but also use predicted offset of previous layer to guide the sampling process, and leverage depth to guide the fusion of features from multi-views.
  • Figure 3: Comparison of GPU Memory consumption. Comparison between Multi-Head Attention Fusion and our Depth-Guided Simplified Multi-Head Attention Fusion. Both evaluated under 40 views and feature dimension is 256.
  • Figure 4: Depth-Guided Simplified Multi-Head Attention Fusion. The process for each point in the volume is shown above. We concat the feature with the depth to predict the multi-head feature and weight.