Table of Contents
Fetching ...

NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images

Lingen Li, Zhaoyang Zhang, Yaowei Li, Jiale Xu, Wenbo Hu, Xiaoyu Li, Weihao Cheng, Jinwei Gu, Tianfan Xue, Ying Shan

TL;DR

NVComposer addresses the need for external multi-view alignment in generative NVS by introducing an image-pose dual-stream diffusion model and a geometry-aware feature alignment adapter. It enables synthesis of novel views from sparse, unposed inputs by having the model infer relative pose relationships during generation and distill 3D geometric priors from dense stereo networks. Trained on a mixed dataset, it achieves state-of-the-art performance on real scenes and synthetic objects, with improved results as the number of unposed inputs increases. By removing explicit pose estimation and pre-reconstruction at inference, it offers a more flexible, robust, and accessible solution for generative NVS across scenes and objects.

Abstract

Recent advancements in generative models have significantly improved novel view synthesis (NVS) from multi-view data. However, existing methods depend on external multi-view alignment processes, such as explicit pose estimation or pre-reconstruction, which limits their flexibility and accessibility, especially when alignment is unstable due to insufficient overlap or occlusions between views. In this paper, we propose NVComposer, a novel approach that eliminates the need for explicit external alignment. NVComposer enables the generative model to implicitly infer spatial and geometric relationships between multiple conditional views by introducing two key components: 1) an image-pose dual-stream diffusion model that simultaneously generates target novel views and condition camera poses, and 2) a geometry-aware feature alignment module that distills geometric priors from dense stereo models during training. Extensive experiments demonstrate that NVComposer achieves state-of-the-art performance in generative multi-view NVS tasks, removing the reliance on external alignment and thus improving model accessibility. Our approach shows substantial improvements in synthesis quality as the number of unposed input views increases, highlighting its potential for more flexible and accessible generative NVS systems. Our project page is available at https://lg-li.github.io/project/nvcomposer

NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images

TL;DR

NVComposer addresses the need for external multi-view alignment in generative NVS by introducing an image-pose dual-stream diffusion model and a geometry-aware feature alignment adapter. It enables synthesis of novel views from sparse, unposed inputs by having the model infer relative pose relationships during generation and distill 3D geometric priors from dense stereo networks. Trained on a mixed dataset, it achieves state-of-the-art performance on real scenes and synthetic objects, with improved results as the number of unposed inputs increases. By removing explicit pose estimation and pre-reconstruction at inference, it offers a more flexible, robust, and accessible solution for generative NVS across scenes and objects.

Abstract

Recent advancements in generative models have significantly improved novel view synthesis (NVS) from multi-view data. However, existing methods depend on external multi-view alignment processes, such as explicit pose estimation or pre-reconstruction, which limits their flexibility and accessibility, especially when alignment is unstable due to insufficient overlap or occlusions between views. In this paper, we propose NVComposer, a novel approach that eliminates the need for explicit external alignment. NVComposer enables the generative model to implicitly infer spatial and geometric relationships between multiple conditional views by introducing two key components: 1) an image-pose dual-stream diffusion model that simultaneously generates target novel views and condition camera poses, and 2) a geometry-aware feature alignment module that distills geometric priors from dense stereo models during training. Extensive experiments demonstrate that NVComposer achieves state-of-the-art performance in generative multi-view NVS tasks, removing the reliance on external alignment and thus improving model accessibility. Our approach shows substantial improvements in synthesis quality as the number of unposed input views increases, highlighting its potential for more flexible and accessible generative NVS systems. Our project page is available at https://lg-li.github.io/project/nvcomposer

Paper Structure

This paper contains 25 sections, 3 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: As the number of unposed input views increases, NVComposer (blue circle) effectively uses the extra information to improve NVS quality. In contrast, ViewCrafter yu2024viewcrafter (green triangle), which relies on external multi-view alignment (via pre-reconstruction from DUSt3R wang2024dust3r), suffers performance degradation as the number of views grows due to instability of the external alignment. This result contradicts the common expectation that "more views lead to better performance." Please refer to \ref{['sec:eval-res']} for full results.
  • Figure 2: Framework illustration of NVComposer. It contains an image-pose dual-stream diffusion model that generates novel views while implicitly estimating camera poses for conditional images, and a geometry-aware feature alignment adapter that uses geometric priors distilled from pretrained dense stereo models wang2024dust3r.
  • Figure 3: Structure of the geometry-aware feature alignment adapter in NVComposer, which aligns the internal features of the dual-stream diffusion models with the 3D point maps produces by DUSt3R wang2024dust3r during training. Block with notation "$\times 2$", "$\times 4$", and "$\times 8$" refer to bilinear upsampling on spatial dimensions. The four red bars refer to the channel-wise MLPs.
  • Figure 4: Visual comparison of NVS results on the RealEstate10K zhou2018stereo and DL3DV ling2024dl3dv test sets. MotionCtrl wang2024motionctrl and CameraCtrl he2024cameractrl uses the first view as input while other methods use two views as input. MotionCtrl and CameraCtrl produce incorrect camera trajectories. DUSt3R and ViewCrafter exhibit better camera control but introduce artifacts due to occlusions or misaligned multi-view inputs. Our model generates views that are visually closer to the reference. We provide zoomed-in details of the first three scenes in white boxes for a closer look. Additional visual comparisons can be found in the supplementary material.
  • Figure 5: Visual comparison of novel view generation results on the Objaverse deitke2023objaverse test set. All input views are unposed and randomly rendered from the same 3D object.
  • ...and 1 more figures