PDF: Point Diffusion Implicit Function for Large-scale Scene Neural Representation
Yuhan Ding, Fukun Yin, Jiayuan Fan, Hui Li, Xin Chen, Wen Liu, Chongshan Lu, Gang YU, Tao Chen
TL;DR
Large-scale outdoor scenes challenge implicit neural representations due to the cubic growth of the sampling space with scene extent, hindering high-fidelity texture synthesis. The authors introduce PDF, a two-stage framework that first learns a dense surface prior via a diffusion-based point cloud super-resolution model and then renders unbounded backgrounds with region-based sampling using Mip-NeRF 360, followed by a foreground-background fusion for complete novel views. The diffusion process is trained with forward and reverse transitions $q(\hat{x}_{t}|\hat{x}_{t-1},z_{0})$ and ${p}_{\theta}(\hat{x}_{t-1}|\hat{x}_{t},z_{0})$, optimized by $\mathcal{L}_{D}$, while rendering uses $(\sigma,r) = \text{Point-NeRF}(c,d,f)$ and a background model with $\mathcal{L}_{R}$. Experiments on the OMMO and BlendedMVS datasets show substantial quantitative gains (PSNR/SSIM up, LPIPS down) and qualitative improvements in foreground detail and background consistency, validating the effectiveness and robustness of the proposed diffusion-based surface priors for large-scale scene neural representation.
Abstract
Recent advances in implicit neural representations have achieved impressive results by sampling and fusing individual points along sampling rays in the sampling space. However, due to the explosively growing sampling space, finely representing and synthesizing detailed textures remains a challenge for unbounded large-scale outdoor scenes. To alleviate the dilemma of using individual points to perceive the entire colossal space, we explore learning the surface distribution of the scene to provide structural priors and reduce the samplable space and propose a Point Diffusion implicit Function, PDF, for large-scale scene neural representation. The core of our method is a large-scale point cloud super-resolution diffusion module that enhances the sparse point cloud reconstructed from several training images into a dense point cloud as an explicit prior. Then in the rendering stage, only sampling points with prior points within the sampling radius are retained. That is, the sampling space is reduced from the unbounded space to the scene surface. Meanwhile, to fill in the background of the scene that cannot be provided by point clouds, the region sampling based on Mip-NeRF 360 is employed to model the background representation. Expensive experiments have demonstrated the effectiveness of our method for large-scale scene novel view synthesis, which outperforms relevant state-of-the-art baselines.
