One at a Time: Progressive Multi-step Volumetric Probability Learning for Reliable 3D Scene Perception
Bohan Li, Yasheng Sun, Jingxin Dong, Zheng Zhu, Jinming Liu, Xin Jin, Wenjun Zeng
TL;DR
The paper addresses the challenge of obtaining reliable 3D volumetric representations for scene perception tasks like MVS and SSC, where single-step methods struggle under occlusions and complex lighting. It proposes Volumetric Probability Diffusion (VPD), a multi-step generative framework that progressively refines volumetric probabilities using a diffusion process conditioned on coarse priors and multi-scale contextual features, aided by the Confidence-Aware Contextual Collaboration (CACC) and Online Filtering (OF) strategies. The approach yields state-of-the-art results on standard MVS benchmarks and demonstrates notable gains in SSC, including surpassing LiDAR-based methods on SemanticKITTI using only camera inputs. This work highlights the potential of diffusion-based, multi-step distribution modeling to produce more accurate and reliable 3D scene representations, with practical impact on robotics and autonomous systems.
Abstract
Numerous studies have investigated the pivotal role of reliable 3D volume representation in scene perception tasks, such as multi-view stereo (MVS) and semantic scene completion (SSC). They typically construct 3D probability volumes directly with geometric correspondence, attempting to fully address the scene perception tasks in a single forward pass. However, such a single-step solution makes it hard to learn accurate and convincing volumetric probability, especially in challenging regions like unexpected occlusions and complicated light reflections. Therefore, this paper proposes to decompose the complicated 3D volume representation learning into a sequence of generative steps to facilitate fine and reliable scene perception. Considering the recent advances achieved by strong generative diffusion models, we introduce a multi-step learning framework, dubbed as VPD, dedicated to progressively refining the Volumetric Probability in a Diffusion process. Extensive experiments are conducted on scene perception tasks including multi-view stereo (MVS) and semantic scene completion (SSC), to validate the efficacy of our method in learning reliable volumetric representations. Notably, for the SSC task, our work stands out as the first to surpass LiDAR-based methods on the SemanticKITTI dataset.
