NeRF-DetS: Enhanced Adaptive Spatial-wise Sampling and View-wise Fusion Strategies for NeRF-based Indoor Multi-view 3D Object Detection
Chi Huang, Xinyang Li, Yansong Qu, Changli Wu, Xiaofan Li, Shengchuan Zhang, Liujuan Cao
TL;DR
This work tackles indoor 3D object detection from multi-view RGB images by leveraging NeRF-based representations. It introduces two key innovations: Progressive Adaptive Sampling Strategy (PASS), which adaptively refines sampling points across detector layers using learned offsets, and Depth-Guided Simplified Multi-Head Attention Fusion (DS-MHA), which fuses multi-view features in a depth-aware, computationally efficient manner with geometry guidance from the NeRF branch. Together, these components enhance feature continuity in 3D space and occlusion-aware fusion, yielding significant improvements over the NeRF-Det baseline on ScanNetV2 (+5.02% AP25, +5.92% AP50) and ARKITScenes. The approach demonstrates that integrating adaptive spatial sampling with depth-guided fusion can robustly exploit NeRF's strengths for end-to-end 3D perception in indoor scenes, with practical benefits in accuracy and efficiency.
Abstract
In indoor scenes, the diverse distribution of object locations and scales makes the visual 3D perception task a big challenge. Previous works (e.g, NeRF-Det) have demonstrated that implicit representation has the capacity to benefit the visual 3D perception task in indoor scenes with high amount of overlap between input images. However, previous works cannot fully utilize the advancement of implicit representation because of fixed sampling and simple multi-view feature fusion. In this paper, inspired by sparse fashion method (e.g, DETR3D), we propose a simple yet effective method, NeRF-DetS, to address above issues. NeRF-DetS includes two modules: Progressive Adaptive Sampling Strategy (PASS) and Depth-Guided Simplified Multi-Head Attention Fusion (DS-MHA). Specifically, (1)PASS can automatically sample features of each layer within a dense 3D detector, using offsets predicted by the previous layer. (2)DS-MHA can not only efficiently fuse multi-view features with strong occlusion awareness but also reduce computational cost. Extensive experiments on ScanNetV2 dataset demonstrate our NeRF-DetS outperforms NeRF-Det, by achieving +5.02% and +5.92% improvement in mAP under IoU25 and IoU50, respectively. Also, NeRF-DetS shows consistent improvements on ARKITScenes.
