Table of Contents
Fetching ...

PDF: Point Diffusion Implicit Function for Large-scale Scene Neural Representation

Yuhan Ding, Fukun Yin, Jiayuan Fan, Hui Li, Xin Chen, Wen Liu, Chongshan Lu, Gang YU, Tao Chen

TL;DR

Large-scale outdoor scenes challenge implicit neural representations due to the cubic growth of the sampling space with scene extent, hindering high-fidelity texture synthesis. The authors introduce PDF, a two-stage framework that first learns a dense surface prior via a diffusion-based point cloud super-resolution model and then renders unbounded backgrounds with region-based sampling using Mip-NeRF 360, followed by a foreground-background fusion for complete novel views. The diffusion process is trained with forward and reverse transitions $q(\hat{x}_{t}|\hat{x}_{t-1},z_{0})$ and ${p}_{\theta}(\hat{x}_{t-1}|\hat{x}_{t},z_{0})$, optimized by $\mathcal{L}_{D}$, while rendering uses $(\sigma,r) = \text{Point-NeRF}(c,d,f)$ and a background model with $\mathcal{L}_{R}$. Experiments on the OMMO and BlendedMVS datasets show substantial quantitative gains (PSNR/SSIM up, LPIPS down) and qualitative improvements in foreground detail and background consistency, validating the effectiveness and robustness of the proposed diffusion-based surface priors for large-scale scene neural representation.

Abstract

Recent advances in implicit neural representations have achieved impressive results by sampling and fusing individual points along sampling rays in the sampling space. However, due to the explosively growing sampling space, finely representing and synthesizing detailed textures remains a challenge for unbounded large-scale outdoor scenes. To alleviate the dilemma of using individual points to perceive the entire colossal space, we explore learning the surface distribution of the scene to provide structural priors and reduce the samplable space and propose a Point Diffusion implicit Function, PDF, for large-scale scene neural representation. The core of our method is a large-scale point cloud super-resolution diffusion module that enhances the sparse point cloud reconstructed from several training images into a dense point cloud as an explicit prior. Then in the rendering stage, only sampling points with prior points within the sampling radius are retained. That is, the sampling space is reduced from the unbounded space to the scene surface. Meanwhile, to fill in the background of the scene that cannot be provided by point clouds, the region sampling based on Mip-NeRF 360 is employed to model the background representation. Expensive experiments have demonstrated the effectiveness of our method for large-scale scene novel view synthesis, which outperforms relevant state-of-the-art baselines.

PDF: Point Diffusion Implicit Function for Large-scale Scene Neural Representation

TL;DR

Large-scale outdoor scenes challenge implicit neural representations due to the cubic growth of the sampling space with scene extent, hindering high-fidelity texture synthesis. The authors introduce PDF, a two-stage framework that first learns a dense surface prior via a diffusion-based point cloud super-resolution model and then renders unbounded backgrounds with region-based sampling using Mip-NeRF 360, followed by a foreground-background fusion for complete novel views. The diffusion process is trained with forward and reverse transitions and , optimized by , while rendering uses and a background model with . Experiments on the OMMO and BlendedMVS datasets show substantial quantitative gains (PSNR/SSIM up, LPIPS down) and qualitative improvements in foreground detail and background consistency, validating the effectiveness and robustness of the proposed diffusion-based surface priors for large-scale scene neural representation.

Abstract

Recent advances in implicit neural representations have achieved impressive results by sampling and fusing individual points along sampling rays in the sampling space. However, due to the explosively growing sampling space, finely representing and synthesizing detailed textures remains a challenge for unbounded large-scale outdoor scenes. To alleviate the dilemma of using individual points to perceive the entire colossal space, we explore learning the surface distribution of the scene to provide structural priors and reduce the samplable space and propose a Point Diffusion implicit Function, PDF, for large-scale scene neural representation. The core of our method is a large-scale point cloud super-resolution diffusion module that enhances the sparse point cloud reconstructed from several training images into a dense point cloud as an explicit prior. Then in the rendering stage, only sampling points with prior points within the sampling radius are retained. That is, the sampling space is reduced from the unbounded space to the scene surface. Meanwhile, to fill in the background of the scene that cannot be provided by point clouds, the region sampling based on Mip-NeRF 360 is employed to model the background representation. Expensive experiments have demonstrated the effectiveness of our method for large-scale scene novel view synthesis, which outperforms relevant state-of-the-art baselines.
Paper Structure (17 sections, 9 equations, 6 figures, 3 tables)

This paper contains 17 sections, 9 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The pipeline of our point diffusion implicit function. Our method consists of two modules, a point diffusion rendering module and a background rendering module. The former learns the surface distribution of the scene through a diffusion-based point cloud super-resolution model and renders foreground features from the dense point cloud surface. The latter follows Mip-NeRF 360's strategy to render background features. Finally, the foreground and background features are fused to generate photo-realistic novel views for large-scale outdoor scenes.
  • Figure 2: Our point upsampling diffusion. In the forward process, Gaussian noise is gradually added to the sparse point cloud. In the reverse process, the noise is gradually removed to obtain a dense point cloud surface.
  • Figure 3: Qualitative results of our method with the baselines on the OMMO dataset. Our PDF method outperforms baseline methods with reliably constructed details. For Mip-NeRF and Mega-NeRF, which are also aimed at large scenes, we use yellow dashed boxes to mark some areas that are easy to distinguish the performance of details. Please zoom-in for the best of views.
  • Figure 4: A failure scene representation of Mip-NeRF 360.
  • Figure 5: Qualitative performance of ablation experiments. From left to right: removing both the diffusion-based point cloud up-sampling module and the background fusion module, removing only the background fusion module, removing only the diffusion-based point cloud up-sampling module, our PDF method, and the groundtruth.
  • ...and 1 more figures