Table of Contents
Fetching ...

Dust to Tower: Coarse-to-Fine Photo-Realistic Scene Reconstruction from Sparse Uncalibrated Images

Xudong Cai, Yongcai Wang, Zhaoxin Fan, Deng Haoran, Shuo Wang, Wanting Li, Deying Li, Lun Luo, Minhang Wang, Jintao Xu

TL;DR

Dust to Tower (D2T) presents a coarse-to-fine framework for photo-realistic scene reconstruction from sparse, uncalibrated images by jointly optimizing a 3D Gaussian Splatting model and camera poses. It introduces CCM for fast coarse modeling, CADA for depth alignment to enable accurate warping, and WIGI to warp and inpaint images at novel viewpoints, providing high-quality supervision for refinement. The approach achieves state-of-the-art results in novel view synthesis and pose estimation across multiple datasets with efficient training times, demonstrating strong practical applicability. The combination of fast initialization, depth-aware warping, and inpainting-based supervision yields robust reconstruction under sparse inputs, with potential for scalable, real-world deployment.

Abstract

Photo-realistic scene reconstruction from sparse-view, uncalibrated images is highly required in practice. Although some successes have been made, existing methods are either Sparse-View but require accurate camera parameters (i.e., intrinsic and extrinsic), or SfM-free but need densely captured images. To combine the advantages of both methods while addressing their respective weaknesses, we propose Dust to Tower (D2T), an accurate and efficient coarse-to-fine framework to optimize 3DGS and image poses simultaneously from sparse and uncalibrated images. Our key idea is to first construct a coarse model efficiently and subsequently refine it using warped and inpainted images at novel viewpoints. To do this, we first introduce a Coarse Construction Module (CCM) which exploits a fast Multi-View Stereo model to initialize a 3D Gaussian Splatting (3DGS) and recover initial camera poses. To refine the 3D model at novel viewpoints, we propose a Confidence Aware Depth Alignment (CADA) module to refine the coarse depth maps by aligning their confident parts with estimated depths by a Mono-depth model. Then, a Warped Image-Guided Inpainting (WIGI) module is proposed to warp the training images to novel viewpoints by the refined depth maps, and inpainting is applied to fulfill the ``holes" in the warped images caused by view-direction changes, providing high-quality supervision to further optimize the 3D model and the camera poses. Extensive experiments and ablation studies demonstrate the validity of D2T and its design choices, achieving state-of-the-art performance in both tasks of novel view synthesis and pose estimation while keeping high efficiency. Codes will be publicly available.

Dust to Tower: Coarse-to-Fine Photo-Realistic Scene Reconstruction from Sparse Uncalibrated Images

TL;DR

Dust to Tower (D2T) presents a coarse-to-fine framework for photo-realistic scene reconstruction from sparse, uncalibrated images by jointly optimizing a 3D Gaussian Splatting model and camera poses. It introduces CCM for fast coarse modeling, CADA for depth alignment to enable accurate warping, and WIGI to warp and inpaint images at novel viewpoints, providing high-quality supervision for refinement. The approach achieves state-of-the-art results in novel view synthesis and pose estimation across multiple datasets with efficient training times, demonstrating strong practical applicability. The combination of fast initialization, depth-aware warping, and inpainting-based supervision yields robust reconstruction under sparse inputs, with potential for scalable, real-world deployment.

Abstract

Photo-realistic scene reconstruction from sparse-view, uncalibrated images is highly required in practice. Although some successes have been made, existing methods are either Sparse-View but require accurate camera parameters (i.e., intrinsic and extrinsic), or SfM-free but need densely captured images. To combine the advantages of both methods while addressing their respective weaknesses, we propose Dust to Tower (D2T), an accurate and efficient coarse-to-fine framework to optimize 3DGS and image poses simultaneously from sparse and uncalibrated images. Our key idea is to first construct a coarse model efficiently and subsequently refine it using warped and inpainted images at novel viewpoints. To do this, we first introduce a Coarse Construction Module (CCM) which exploits a fast Multi-View Stereo model to initialize a 3D Gaussian Splatting (3DGS) and recover initial camera poses. To refine the 3D model at novel viewpoints, we propose a Confidence Aware Depth Alignment (CADA) module to refine the coarse depth maps by aligning their confident parts with estimated depths by a Mono-depth model. Then, a Warped Image-Guided Inpainting (WIGI) module is proposed to warp the training images to novel viewpoints by the refined depth maps, and inpainting is applied to fulfill the ``holes" in the warped images caused by view-direction changes, providing high-quality supervision to further optimize the 3D model and the camera poses. Extensive experiments and ablation studies demonstrate the validity of D2T and its design choices, achieving state-of-the-art performance in both tasks of novel view synthesis and pose estimation while keeping high efficiency. Codes will be publicly available.
Paper Structure (33 sections, 9 equations, 8 figures, 5 tables)

This paper contains 33 sections, 9 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Relationship Between Training Time and PSNR. We show the training time and PSNR on Tanks and Temples dataset with three input views. Our method is Pareto-optimal on the efficiency-accuracy trade-off when compared to existing baselines.
  • Figure 2: Overview of D2T. Given sparse-view and uncalibrated images, the Coarse Construction Module (CCM) first employs an efficient MVS method DUSt3R to construct a coarse point cloud and rough camera poses to initialize a 3DGS. The initial 3DGS and poses are optimized simultaneously using the input images for a few steps (\ref{['sec:coarseSolution']}). To refine the model at novel viewpoints, a Confidence Aware Depth Alignment (CADA) module is proposed to enhance the warping accuracy by aligning relative inverse depth from a SOTA mono-depth model (\ref{['sec:CADA']}). Then, we propose a Warped Image-Guided Inpainting (WIGI) module to warp input images to unseen viewpoints and inpaint the missing part in the warped images by a lightweight inpainting model (\ref{['sec:warping']}). Finally, 3DGS and poses are further refined by the inpainted images at novel viewpoints.
  • Figure 3: Overview of the Confidence Aware Depth Alignment. We align the mono-depth $\mathbf{D}_c^m$ to the reliable part of the coarse depth $\mathbf{D}_c^{up}$, resulting in high quality depth map $\mathbf{D}_c^h$.
  • Figure 4: Visualization of the Mask Clean. Without Mask Clean, outliers in the warped image and warped mask can mislead the inpainting model, resulting in irrational results. The quality of inpainted images are improved after applying Mask Clean.
  • Figure 5: Visualization of the inpainting results. Tanks means the Tanks and Temples dataset.
  • ...and 3 more figures