Table of Contents
Fetching ...

SemanticNVS: Improving Semantic Scene Understanding in Generative Novel View Synthesis

Xinya Chen, Christopher Wewer, Jiahao Xie, Xinting Hu, Jan Eric Lenssen

TL;DR

SemanticNVS is presented, a camera-conditioned multi-view diffusion model for novel view synthesis (NVS), which improves generation quality and consistency by integrating pre-trained semantic feature extractors by integrating pre-trained semantic feature extractors.

Abstract

We present SemanticNVS, a camera-conditioned multi-view diffusion model for novel view synthesis (NVS), which improves generation quality and consistency by integrating pre-trained semantic feature extractors. Existing NVS methods perform well for views near the input view, however, they tend to generate semantically implausible and distorted images under long-range camera motion, revealing severe degradation. We speculate that this degradation is due to current models failing to fully understand their conditioning or intermediate generated scene content. Here, we propose to integrate pre-trained semantic feature extractors to incorporate stronger scene semantics as conditioning to achieve high-quality generation even at distant viewpoints. We investigate two different strategies, (1) warped semantic features and (2) an alternating scheme of understanding and generation at each denoising step. Experimental results on multiple datasets demonstrate the clear qualitative and quantitative (4.69%-15.26% in FID) improvement over state-of-the-art alternatives.

SemanticNVS: Improving Semantic Scene Understanding in Generative Novel View Synthesis

TL;DR

SemanticNVS is presented, a camera-conditioned multi-view diffusion model for novel view synthesis (NVS), which improves generation quality and consistency by integrating pre-trained semantic feature extractors by integrating pre-trained semantic feature extractors.

Abstract

We present SemanticNVS, a camera-conditioned multi-view diffusion model for novel view synthesis (NVS), which improves generation quality and consistency by integrating pre-trained semantic feature extractors. Existing NVS methods perform well for views near the input view, however, they tend to generate semantically implausible and distorted images under long-range camera motion, revealing severe degradation. We speculate that this degradation is due to current models failing to fully understand their conditioning or intermediate generated scene content. Here, we propose to integrate pre-trained semantic feature extractors to incorporate stronger scene semantics as conditioning to achieve high-quality generation even at distant viewpoints. We investigate two different strategies, (1) warped semantic features and (2) an alternating scheme of understanding and generation at each denoising step. Experimental results on multiple datasets demonstrate the clear qualitative and quantitative (4.69%-15.26% in FID) improvement over state-of-the-art alternatives.
Paper Structure (17 sections, 14 equations, 5 figures, 6 tables)

This paper contains 17 sections, 14 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Method Overview.SemanticNVS integrates semantic DINO features into the multi-view diffusion setup in two different ways. First, it provides warped features from the given input view and uses them as additional conditioning. Second, it extracts DINO features from intermediate generations $\hat{x}^t_0$ of the previous iteration and uses them to complete the warped DINO features.
  • Figure 2: Qualitative Comparison on RealEstate10K. ViewCrafter and Uni3C struggle to follow the long camera trajectory accurately. SEVA produces degraded views when moving far from the input. In contrast, our method better adheres to the target trajectory, generates more realistic novel views, and yields a more coherent underlying geometry when reconstructing the 3D scene from the generated frames. More results can be found in the appendix and the https://semanticnvs.github.io/.
  • Figure 3: Qualitative Comparison on Tanks-and-Temples. ViewCrafter, Uni3C, and SEVA fail to follow the long camera trajectory and produce unrealistic or degraded views. In contrast, our method better adheres to the target trajectory, generates more realistic novel views, and yields a more coherent underlying geometry when reconstructing from the generated frames.
  • Figure 4: Qualitative Ablation on RealEstate10K. Warped RGB fails to generate the floor-to-ceiling glass window on the right side of the input view. Adding Warped DINO produces a clear window, but the chair remains incomplete. Further guiding sampling with intermediate-sample features (Iterative DINO) enables view-consistent synthesis of the entire scene.
  • Figure A1: Qualitative Comparison on RealEstate10K.