Table of Contents
Fetching ...

UpFusion: Novel View Diffusion from Unposed Sparse View Observations

Bharath Raj Nagoor Kani, Hsin-Ying Lee, Sergey Tulyakov, Shubham Tulsiani

TL;DR

UpFusion tackles novel-view synthesis and 3D reconstruction from sparse, unposed image sets without explicit camera poses. It combines an UpSRT-based scene transformer to generate query-view aligned features with a conditional diffusion model, and augments conditioning with direct attention to input patches, enabling high-fidelity 2D views. To ensure 3D consistency, it further distills a 3D representation by maximizing likelihood under the learned diffusion distribution via Score Distillation Sampling, and optimizes using Instant-NGP. Evaluations on Co3Dv2 and Google Scanned Objects show UpFusion outperforms pose-dependent sparse-view methods and single-view baselines, with notable generalization to unseen categories and in-the-wild images. The approach offers a practical path to in-the-wild sparse-view 3D inference, reducing the reliance on accurate pose estimation while leveraging strong diffusion priors for both 2D rendering and 3D reconstruction.

Abstract

We propose UpFusion, a system that can perform novel view synthesis and infer 3D representations for an object given a sparse set of reference images without corresponding pose information. Current sparse-view 3D inference methods typically rely on camera poses to geometrically aggregate information from input views, but are not robust in-the-wild when such information is unavailable/inaccurate. In contrast, UpFusion sidesteps this requirement by learning to implicitly leverage the available images as context in a conditional generative model for synthesizing novel views. We incorporate two complementary forms of conditioning into diffusion models for leveraging the input views: a) via inferring query-view aligned features using a scene-level transformer, b) via intermediate attentional layers that can directly observe the input image tokens. We show that this mechanism allows generating high-fidelity novel views while improving the synthesis quality given additional (unposed) images. We evaluate our approach on the Co3Dv2 and Google Scanned Objects datasets and demonstrate the benefits of our method over pose-reliant sparse-view methods as well as single-view methods that cannot leverage additional views. Finally, we also show that our learned model can generalize beyond the training categories and even allow reconstruction from self-captured images of generic objects in-the-wild.

UpFusion: Novel View Diffusion from Unposed Sparse View Observations

TL;DR

UpFusion tackles novel-view synthesis and 3D reconstruction from sparse, unposed image sets without explicit camera poses. It combines an UpSRT-based scene transformer to generate query-view aligned features with a conditional diffusion model, and augments conditioning with direct attention to input patches, enabling high-fidelity 2D views. To ensure 3D consistency, it further distills a 3D representation by maximizing likelihood under the learned diffusion distribution via Score Distillation Sampling, and optimizes using Instant-NGP. Evaluations on Co3Dv2 and Google Scanned Objects show UpFusion outperforms pose-dependent sparse-view methods and single-view baselines, with notable generalization to unseen categories and in-the-wild images. The approach offers a practical path to in-the-wild sparse-view 3D inference, reducing the reliance on accurate pose estimation while leveraging strong diffusion priors for both 2D rendering and 3D reconstruction.

Abstract

We propose UpFusion, a system that can perform novel view synthesis and infer 3D representations for an object given a sparse set of reference images without corresponding pose information. Current sparse-view 3D inference methods typically rely on camera poses to geometrically aggregate information from input views, but are not robust in-the-wild when such information is unavailable/inaccurate. In contrast, UpFusion sidesteps this requirement by learning to implicitly leverage the available images as context in a conditional generative model for synthesizing novel views. We incorporate two complementary forms of conditioning into diffusion models for leveraging the input views: a) via inferring query-view aligned features using a scene-level transformer, b) via intermediate attentional layers that can directly observe the input image tokens. We show that this mechanism allows generating high-fidelity novel views while improving the synthesis quality given additional (unposed) images. We evaluate our approach on the Co3Dv2 and Google Scanned Objects datasets and demonstrate the benefits of our method over pose-reliant sparse-view methods as well as single-view methods that cannot leverage additional views. Finally, we also show that our learned model can generalize beyond the training categories and even allow reconstruction from self-captured images of generic objects in-the-wild.
Paper Structure (31 sections, 3 equations, 13 figures, 4 tables)

This paper contains 31 sections, 3 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: 3D Inference from Unposed Sparse views. Given a sparse set of input images without associated camera poses, our proposed system UpFusion allows recovering a 3D representation and synthesizing novel views. Images with black border are 1, 3, or 6 unposed input views of an object. Images with green border are novel views synthesized using our approach.
  • Figure 2: UpSRTsrt22 performs novel view synthesis from a set of unposed images. UpSRT consists of an encoder, a decoder, and an MLP. The encoder takes encoded image features as inputs and outputs a set-latent representation ${\bm{c}}_s$. The decoder takes query rays as inputs and attends to the set-latent representation to get features ${\bm{c}}_d$, which are then fed into MLP to obtain final novel view RGB images. We make use of both ${\bm{c}}_s$ and ${\bm{c}}_d$ to provide conditional context to our model.
  • Figure 3: UpFusion 2D is the proposed conditional diffusion model performing novel view synthesis conditional on information extracted from a set of unposed images. To reason about the query view, Upfusion takes as additional inputs the view-aligned decoder features ${\bm{c}}_d$ obtained from UpSRT decoder. To further allow the model to attend to details from input views, UpFusion condition on the set-latent representation ${\bm{c}}_s$ via attentional layers.
  • Figure 4: Qualitative comparison with sparse-view baselines. We compare UpFusion with baseline methods using 3 and 6 unposed images as inputs. SparseFusion fails to capture the correct geometry, due to the imperfect camera poses estimated by RelPose++. UpSRT generates blurry results due to the nature of regression-based methods. On the contrary, UpFusion 2D synthesizes sharp outputs with correct object poses. UpFusion 3D further improves the 3D consistency.
  • Figure 5: Generalization beyond training categories. We show results for UpFusion (3D) across object categories not seen in training. For each instance, we present the 1, 3, or 6 unposed input views (left), as well as 4 novel view renderings (right). We observe that despite not being trained on these categories, UpFusion is able to accurately infer the underlying 3D structure and generate detailed novel views.
  • ...and 8 more figures