VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis
Sen Wang, Qing Cheng, Stefano Gasperini, Wei Zhang, Shun-Cheng Wu, Niclas Zeller, Daniel Cremers, Nassir Navab
TL;DR
VoxNeRF addresses indoor sparse-view novel view synthesis by integrating geometry priors into NeRF via a Sparse Voxel Octree (SVO) and a voxel-guided sampling strategy. The method models geometry uncertainty with a Gaussian around ray-surface intersections, densifies sampling near surfaces, and optimizes with a robust depth loss plus depth-gradient regularization within a Multi-resolution Hash Grid radiance framework, yielding $L = L_c + \lambda_d( L_d + L_{reg})$. Empirical results on ScanNet and ScanNet++ show VoxNeRF achieving state-of-the-art fidelity and significantly faster training compared to prior methods, especially in extrapolation scenarios where view overlap is limited. This geometry-guided approach enhances robustness to occlusions and textureless regions, offering practical benefits for indoor robotics, though it relies on preprocessing steps to obtain the priors. Overall, VoxNeRF demonstrates how structured scene priors can dramatically improve both the quality and efficiency of indoor neural rendering, paving the way for real-time robotic perception and interaction.
Abstract
The generation of high-fidelity view synthesis is essential for robotic navigation and interaction but remains challenging, particularly in indoor environments and real-time scenarios. Existing techniques often require significant computational resources for both training and rendering, and they frequently result in suboptimal 3D representations due to insufficient geometric structuring. To address these limitations, we introduce VoxNeRF, a novel approach that utilizes easy-to-obtain geometry priors to enhance both the quality and efficiency of neural indoor reconstruction and novel view synthesis. We propose an efficient voxel-guided sampling technique that allocates computational resources selectively to the most relevant segments of rays based on a voxel-encoded geometry prior, significantly reducing training and rendering time. Additionally, we incorporate a robust depth loss to improve reconstruction and rendering quality in sparse view settings. Our approach is validated with extensive experiments on ScanNet and ScanNet++ where VoxNeRF outperforms existing state-of-the-art methods and establishes a new benchmark for indoor immersive interpolation and extrapolation settings.
