Table of Contents
Fetching ...

LucidFusion: Reconstructing 3D Gaussians with Arbitrary Unposed Images

Hao He, Yixun Liang, Luozhou Wang, Yuanhao Cai, Xinli Xu, Hao-Xiang Guo, Xiang Wen, Yingcong Chen

TL;DR

LucidFusion tackles the reliance on camera poses in 3D reconstruction from unposed images by introducing Relative Coordinate Maps (RCM) and extending them to Relative Coordinate Gaussians (RCG) for global 3D consistency. The method regresses RCM from images in a pose-free manner and then uses differentiable rendering of RCGs to supervise 3D geometry, enabling accurate pose recovery via PnP and robust multi-view fusion. A two-stage training regime stabilizes learning and yields superior reconstruction quality and pose accuracy under sparse views, outperforming pose-free and feed-forward baselines while supporting arbitrary input counts. The approach offers a practical, fast (seconds-scale) pipeline that integrates with single-image-to-3D workflows and reduces the need for explicit pose estimation in real-world applications.

Abstract

Recent large reconstruction models have made notable progress in generating high-quality 3D objects from single images. However, current reconstruction methods often rely on explicit camera pose estimation or fixed viewpoints, restricting their flexibility and practical applicability. We reformulate 3D reconstruction as image-to-image translation and introduce the Relative Coordinate Map (RCM), which aligns multiple unposed images to a main view without pose estimation. While RCM simplifies the process, its lack of global 3D supervision can yield noisy outputs. To address this, we propose Relative Coordinate Gaussians (RCG) as an extension to RCM, which treats each pixel's coordinates as a Gaussian center and employs differentiable rasterization for consistent geometry and pose recovery. Our LucidFusion framework handles an arbitrary number of unposed inputs, producing robust 3D reconstructions within seconds and paving the way for more flexible, pose-free 3D pipelines.

LucidFusion: Reconstructing 3D Gaussians with Arbitrary Unposed Images

TL;DR

LucidFusion tackles the reliance on camera poses in 3D reconstruction from unposed images by introducing Relative Coordinate Maps (RCM) and extending them to Relative Coordinate Gaussians (RCG) for global 3D consistency. The method regresses RCM from images in a pose-free manner and then uses differentiable rendering of RCGs to supervise 3D geometry, enabling accurate pose recovery via PnP and robust multi-view fusion. A two-stage training regime stabilizes learning and yields superior reconstruction quality and pose accuracy under sparse views, outperforming pose-free and feed-forward baselines while supporting arbitrary input counts. The approach offers a practical, fast (seconds-scale) pipeline that integrates with single-image-to-3D workflows and reduces the need for explicit pose estimation in real-world applications.

Abstract

Recent large reconstruction models have made notable progress in generating high-quality 3D objects from single images. However, current reconstruction methods often rely on explicit camera pose estimation or fixed viewpoints, restricting their flexibility and practical applicability. We reformulate 3D reconstruction as image-to-image translation and introduce the Relative Coordinate Map (RCM), which aligns multiple unposed images to a main view without pose estimation. While RCM simplifies the process, its lack of global 3D supervision can yield noisy outputs. To address this, we propose Relative Coordinate Gaussians (RCG) as an extension to RCM, which treats each pixel's coordinates as a Gaussian center and employs differentiable rasterization for consistent geometry and pose recovery. Our LucidFusion framework handles an arbitrary number of unposed inputs, producing robust 3D reconstructions within seconds and paving the way for more flexible, pose-free 3D pipelines.

Paper Structure

This paper contains 24 sections, 10 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: LucidFusion utilizes Relative Coordinate Gaussian (RCG) representation to achieve 3D reconstruction with pose estimation from unposed, sparse and arbitrary numbers of input views in a feed-forward manner.
  • Figure 2: Pilot study. We compare CCM and RCM given a set of input images. CCM fails to maintain consistency across different input views, as shown in the red box, while RCM successfully maintains the 2D-3D relation, as shown in blue box.
  • Figure 3: Pipeline Overview of LucidFusion. Our framework processes a set of sparse, unposed multi-view images as input. The model predicts the RCM representation for the input images. Additionally, the feature map from the final layer of the encoder is fed into a decoder network to extend the RCM representation to RCG. The RCG is then rendered at novel views and supervised with ground truth images.
  • Figure 4: Visualization of stage 1 and stage 2 results.
  • Figure 5: Comparison with single and two-stage training. For single stage, the model struggles to predict Gaussian locations.
  • ...and 5 more figures