CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoors Object Detection from Multi-view Images

Guanlin Shen; Jingwei Huang; Zhihua Hu; Bin Wang

CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoors Object Detection from Multi-view Images

Guanlin Shen, Jingwei Huang, Zhihua Hu, Bin Wang

TL;DR

CN-RMA tackles indoor 3D object detection from multi-view images by unifying a neural implicit scene reconstruction with an occlusion-aware feature aggregation. The core idea, Ray Marching Aggregation (RMA), weights image feature votes along rays using a rough TSDF-derived volume density and transmittance, mitigating occlusion-induced misprojections. The method integrates an Atlas-inspired MVS module with a FCAF3D detector in an end-to-end trainable pipeline, achieved through a three-stage training scheme. Empirically, CN-RMA delivers state-of-the-art mAP@0.25 and mAP@0.5 on ScanNet and ARKitScenes, substantially outperforming prior single-stage and two-stage approaches. The work demonstrates the value of geometry-informed, occlusion-aware aggregation for robust indoor 3D perception and paves the way for broader use of implicit representations in 3D detection tasks.

Abstract

This paper introduces CN-RMA, a novel approach for 3D indoor object detection from multi-view images. We observe the key challenge as the ambiguity of image and 3D correspondence without explicit geometry to provide occlusion information. To address this issue, CN-RMA leverages the synergy of 3D reconstruction networks and 3D object detection networks, where the reconstruction network provides a rough Truncated Signed Distance Function (TSDF) and guides image features to vote to 3D space correctly in an end-to-end manner. Specifically, we associate weights to sampled points of each ray through ray marching, representing the contribution of a pixel in an image to corresponding 3D locations. Such weights are determined by the predicted signed distances so that image features vote only to regions near the reconstructed surface. Our method achieves state-of-the-art performance in 3D object detection from multi-view images, as measured by mAP@0.25 and mAP@0.5 on the ScanNet and ARKitScenes datasets. The code and models are released at https://github.com/SerCharles/CN-RMA.

CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoors Object Detection from Multi-view Images

TL;DR

Abstract

Paper Structure (23 sections, 6 equations, 7 figures, 9 tables)

This paper contains 23 sections, 6 equations, 7 figures, 9 tables.

Introduction
Related Work
3D Object Detection from Multi-View Images
Neural Implicit Reconstruction
3D Object Detection from Point Clouds
Method
Problem Formulation
Multi-View Stereo Module
Ray Marching Aggregation
3D Object Detection Network
Training Procedure
Experiments
Datasets, Metrics, and Baselines
Implementation Details
Comparison
...and 8 more sections

Figures (7)

Figure 1: The Comparison of Our CN-RMA, the two-stage method, and ImVoxelNetrukhovich2022imvoxelnet. Our CN-RMA is an end-to-end object detection method that incorporates an occlusion-aware 2D to 3D aggregation technique. In contrast, the two-stage method lacks end-to-end trainability, while ImVoxelNet employs a heuristic aggregation method that disregards occlusion considerations.
Figure 2: The overall architecture of our CN-RMA method. The purple blocks represent neural networks, while the green blocks represent modules without trainable neurons. Following Atlas murez2020atlas, the 2D CNN backbone $\mathcal{F}$ is a ResNet50-FPN network lin2017feature, and the 3D reconstruction network $\mathcal{R}$ is a 3D CNN network that features an encoder-decoder structure with skip connections with a $1\times1\times1$ convolutional head. Following FCAF3D rukhovich2022fcaf3d, the object detection network $\mathcal{D}$ is a sparse 3D convolutional network comprising a ResNet34 backbone choy20194d and a 4-layer decoder network.
Figure 3: The 1D illustration comparing our Ray Marching Aggregation (RMA) method, with the Depth Aggregation method (DA) based on depth prediction, and the Volume Aggregation method(VA) based on unprojection rukhovich2022imvoxelnetsun2021neuralreconmurez2020atlas. The points depicted in the illustration represent sample points along a ray, with their colors indicating their respective weights. The points enclosed within one red square represent selected points.
Figure 4: Visualization of 3D object detection results from ScanNet dai2017scannet. From above to below are scene0559$\_$01, scene0598$\_$00, and scene0701$\_$00 from ScanNet. Atlas+FCAF denotes the two-stage baseline combining Atlas murez2020atlas and FCAF3D rukhovich2022fcaf3d, and Neucon+FCAF denotes the two-stage baseline combining NeuralRecon sun2021neuralrecon and FCAF3D.
Figure 5: Visualization of 3D object detection results from ARKitScenes dehghan2021arkitscenes. From above to below are scenes 44358583, 45663154, and 45261181 from ARKitScenes. Atlas+FCAF denotes the two-stage baseline combining Atlas murez2020atlas and FCAF3D rukhovich2022fcaf3d, and Neucon+FCAF denotes the two-stage baseline combining NeuralRecon sun2021neuralrecon and FCAF3D.
...and 2 more figures

CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoors Object Detection from Multi-view Images

TL;DR

Abstract

CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoors Object Detection from Multi-view Images

Authors

TL;DR

Abstract

Table of Contents

Figures (7)