The More You See in 2D, the More You Perceive in 3D

Xinyang Han; Zelin Gao; Angjoo Kanazawa; Shubham Goel; Yossi Gandelsman

The More You See in 2D, the More You Perceive in 3D

Xinyang Han, Zelin Gao, Angjoo Kanazawa, Shubham Goel, Yossi Gandelsman

TL;DR

SAP3D addresses 3D reconstruction and novel view synthesis from an arbitrary number of unposed images by test-time adapting a view-conditioned diffusion model and refining camera poses to produce instance-specific priors. The method initializes with coarse poses and a pre-trained diffusion model, then jointly optimizes the diffusion network and camera parameters to distill 3D priors that enable 3D reconstruction via NeRF-like representations and multi-view consistency. Experiments on synthetic (GSO) and real data demonstrate that more input views yield higher fidelity geometry and textures as well as improved novel view synthesis, with ablations confirming the importance of test-time adaptation and 3D-prior preservation. The work bridges single-view diffusion-based 3D methods and traditional multi-view optimization by providing a flexible, learnable prior-driven framework that benefits from better pre-trained models and larger datasets. While effective, the approach currently relies on offline optimization and a constrained pose parameterization, suggesting avenues for end-to-end, real-time, and more expressive parametric control in future work.

Abstract

Humans can infer 3D structure from 2D images of an object based on past experience and improve their 3D understanding as they see more images. Inspired by this behavior, we introduce SAP3D, a system for 3D reconstruction and novel view synthesis from an arbitrary number of unposed images. Given a few unposed images of an object, we adapt a pre-trained view-conditioned diffusion model together with the camera poses of the images via test-time fine-tuning. The adapted diffusion model and the obtained camera poses are then utilized as instance-specific priors for 3D reconstruction and novel view synthesis. We show that as the number of input images increases, the performance of our approach improves, bridging the gap between optimization-based prior-less 3D reconstruction methods and single-image-to-3D diffusion-based methods. We demonstrate our system on real images as well as standard synthetic benchmarks. Our ablation studies confirm that this adaption behavior is key for more accurate 3D understanding.

The More You See in 2D, the More You Perceive in 3D

TL;DR

Abstract

Paper Structure (47 sections, 11 equations, 9 figures, 6 tables)

This paper contains 47 sections, 11 equations, 9 figures, 6 tables.

Introduction
Related Work
Instance-specific 3D Reconstruction.
Single-view 3D Reconstruction.
Few-view 3D Reconstruction
Test-Time Adaptation.
SAP3D
Initialization
Initial camera poses.
Initial view-conditioned 2D diffusion model.
Test-time optimization
Finetuning the diffusion model.
Optimizing camera poses.
3D prior preservation loss.
Novel View Synthesis
...and 32 more sections

Figures (9)

Figure 1: 3D from one or more unposed views. Our system reconstructs the 3D shape and texture of an object with a variable number of real input images. The first, second, and third rows show reconstructions from 1, 3, and 5 input images. The quality of 3D shape and texture improves with more views.
Figure 2: Overview of SAP3D. We first compute coarse relative camera poses using an off-the-shelf model. We fine-tune a view-conditioned 2D diffusion model on the input images and simultaneously refine the camera poses via optimization. The resulting instance-specific diffusion model and camera poses enable 3D reconstruction and novel view synthesis from an arbitrary number of input images.
Figure 3: 3D reconstructions with one or more images. Qualitative visualizations with 1, 3, and 5 views for SAP3D on real images (left column) and instances from the synthetic GSO dataset (right column). Observe how the wings of the eagle, the spiky weapon of the green turtle, and the yellow bunny's bouquet of flowers, all become more detailed and accurate with more views.
Figure 4: F1-Score for 3D reconstruction per input set size.
Figure 5: SAP3D novel view qualitative results. We present results for 1, 3, and 5 input images. With more input images, SAP3D improves fidelity of generated 3D details.
...and 4 more figures

The More You See in 2D, the More You Perceive in 3D

TL;DR

Abstract

The More You See in 2D, the More You Perceive in 3D

Authors

TL;DR

Abstract

Table of Contents

Figures (9)