Table of Contents
Fetching ...

OrbitNVS: Harnessing Video Diffusion Priors for Novel View Synthesis

Jinglin Liang, Zijian Zhou, Rui Huang, Shuangping Huang, Yichen Gong

Abstract

Novel View Synthesis (NVS) aims to generate unseen views of a 3D object given a limited number of known views. Existing methods often struggle to synthesize plausible views for unobserved regions, particularly under single-view input, and still face challenges in maintaining geometry- and appearance-consistency. To address these issues, we propose OrbitNVS, which reformulates NVS as an orbit video generation task. Through tailored model design and training strategies, we adapt a pre-trained video generation model to the NVS task, leveraging its rich visual priors to achieve high-quality view synthesis. Specifically, we incorporate camera adapters into the video model to enable accurate camera control. To enhance two key properties of 3D objects, geometry and appearance, we design a normal map generation branch and use normal map features to guide the synthesis of the target views via attention mechanism, thereby improving geometric consistency. Moreover, we apply a pixel-space supervision to alleviate blurry appearance caused by spatial compression in the latent space. Extensive experiments show that OrbitNVS significantly outperforms previous methods on the GSO and OmniObject3D benchmarks, especially in the challenging single-view setting (\eg, +2.9 dB and +2.4 dB PSNR).

OrbitNVS: Harnessing Video Diffusion Priors for Novel View Synthesis

Abstract

Novel View Synthesis (NVS) aims to generate unseen views of a 3D object given a limited number of known views. Existing methods often struggle to synthesize plausible views for unobserved regions, particularly under single-view input, and still face challenges in maintaining geometry- and appearance-consistency. To address these issues, we propose OrbitNVS, which reformulates NVS as an orbit video generation task. Through tailored model design and training strategies, we adapt a pre-trained video generation model to the NVS task, leveraging its rich visual priors to achieve high-quality view synthesis. Specifically, we incorporate camera adapters into the video model to enable accurate camera control. To enhance two key properties of 3D objects, geometry and appearance, we design a normal map generation branch and use normal map features to guide the synthesis of the target views via attention mechanism, thereby improving geometric consistency. Moreover, we apply a pixel-space supervision to alleviate blurry appearance caused by spatial compression in the latent space. Extensive experiments show that OrbitNVS significantly outperforms previous methods on the GSO and OmniObject3D benchmarks, especially in the challenging single-view setting (\eg, +2.9 dB and +2.4 dB PSNR).
Paper Structure (18 sections, 6 equations, 10 figures, 5 tables)

This paper contains 18 sections, 6 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: OrbitNVS leverages the rich visual prior of video generation models to achieve highly consistent NVS with precise camera control. Its strong prior enables impressive reasoning capabilities in the unseen views. For instance, it can infer the detailed front view of a robot from only its back, or deduce the presence of windows on the opposite side of a house from a single frontal image.
  • Figure 2: Overview of OrbitNVS. (a) Model architecture. The model incorporates camera conditions by inserting camera adapters before the first and into each subsequent Dit block. A normal map generation branch, which shares parameters with the primary branch, is added to guide the RGB video synthesis. Other conditions are injected via two pathways: 1) channel-wise concatenation of the reference frame's latent representation and a positional mask with the noised latent, and 2) integration of text embeddings and the first frame's CLIP embedding via cross-attention at each layer. (b) Loss computation. The latent representations are decoded back to pixel space using the VAE decoder to compute a pixel-space loss, while the original latent loss is retained.
  • Figure 3: Qualitative comparison of NVS results generated by different methods under the single reference view setting.
  • Figure 4: Ablation on the normal map generation branch. Results with the branch (c, d) show significantly clearer geometric details than those without it (b).
  • Figure 5: Ablation on the pixel-space post-training. The result with the pixel loss (d) shows superior plausibility in detailed textures compared to the result without it (b).
  • ...and 5 more figures