Table of Contents
Fetching ...

GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

Junyoung Seo, Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, Chieh-Hsin Lai, Seungryong Kim, Yuki Mitsufuji

TL;DR

<3-5 sentence high-level summary>GenWarp tackles the ill-posed problem of generating novel views from a single image by moving beyond explicit depth-based warping and inpainting. It introduces a two-stream diffusion framework that jointly learns a semantic-preserving representation of the source view and a target view through implicit warping, achieved via coordinate embeddings and augmented cross-view attention conditioned on depth signals $D_i$, camera pose $P_{i \rightarrow j}$, and intrinsics $K$. The method fine-tunes a pretrained diffusion model and CLIP-based image conditioning to generate semantically faithful, high-quality novel views, outperforming warping-based and attention-based baselines on both in-domain datasets (e.g., RealEstate10K, ScanNet) and out-of-domain images. This approach advances practical single-shot NVS by preserving input semantics, enabling wider applicability in real-world scenes and complex viewpoints.

Abstract

Generating novel views from a single image remains a challenging task due to the complexity of 3D scenes and the limited diversity in the existing multi-view datasets to train a model on. Recent research combining large-scale text-to-image (T2I) models with monocular depth estimation (MDE) has shown promise in handling in-the-wild images. In these methods, an input view is geometrically warped to novel views with estimated depth maps, then the warped image is inpainted by T2I models. However, they struggle with noisy depth maps and loss of semantic details when warping an input view to novel viewpoints. In this paper, we propose a novel approach for single-shot novel view synthesis, a semantic-preserving generative warping framework that enables T2I generative models to learn where to warp and where to generate, through augmenting cross-view attention with self-attention. Our approach addresses the limitations of existing methods by conditioning the generative model on source view images and incorporating geometric warping signals. Qualitative and quantitative evaluations demonstrate that our model outperforms existing methods in both in-domain and out-of-domain scenarios. Project page is available at https://GenWarp-NVS.github.io/.

GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

TL;DR

<3-5 sentence high-level summary>GenWarp tackles the ill-posed problem of generating novel views from a single image by moving beyond explicit depth-based warping and inpainting. It introduces a two-stream diffusion framework that jointly learns a semantic-preserving representation of the source view and a target view through implicit warping, achieved via coordinate embeddings and augmented cross-view attention conditioned on depth signals , camera pose , and intrinsics . The method fine-tunes a pretrained diffusion model and CLIP-based image conditioning to generate semantically faithful, high-quality novel views, outperforming warping-based and attention-based baselines on both in-domain datasets (e.g., RealEstate10K, ScanNet) and out-of-domain images. This approach advances practical single-shot NVS by preserving input semantics, enabling wider applicability in real-world scenes and complex viewpoints.

Abstract

Generating novel views from a single image remains a challenging task due to the complexity of 3D scenes and the limited diversity in the existing multi-view datasets to train a model on. Recent research combining large-scale text-to-image (T2I) models with monocular depth estimation (MDE) has shown promise in handling in-the-wild images. In these methods, an input view is geometrically warped to novel views with estimated depth maps, then the warped image is inpainted by T2I models. However, they struggle with noisy depth maps and loss of semantic details when warping an input view to novel viewpoints. In this paper, we propose a novel approach for single-shot novel view synthesis, a semantic-preserving generative warping framework that enables T2I generative models to learn where to warp and where to generate, through augmenting cross-view attention with self-attention. Our approach addresses the limitations of existing methods by conditioning the generative model on source view images and incorporating geometric warping signals. Qualitative and quantitative evaluations demonstrate that our model outperforms existing methods in both in-domain and out-of-domain scenarios. Project page is available at https://GenWarp-NVS.github.io/.
Paper Structure (32 sections, 6 equations, 12 figures, 2 tables)

This paper contains 32 sections, 6 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Teaser. Our model generates plausible novel views, conditioned on only a single input view, enabling to handle both in-domain images (top) and out-of-domain images (bottom).
  • Figure 2: Limitations of explicit warping-and-inpainting approach rombach2022highchung2023luciddreamerouyang2023text2immersion. Results from challenging new camera viewpoints for warping-and-inpainting approach show artifacts. (a) The neon sign present in the input view is distorted after geometric warping due to the noisy depth. (b) The next room peeked in from the new camera viewpoint lacks the context given by the input view.
  • Figure 3: Method overview: (Left) Given an input view and a desired camera viewpoint, we obtain a pair of embeddings: a 2D coordinate embedding for the input view, and a warped coordinate embedding for the novel view from estimated depth through MDE. With these embeddings, a semantic preserver network produces a semantic feature of the input view, and a diffusion model conditioned on them learns to conduct geometric warping to generate novel views. (Right) We augment self-attention with cross-view attention, followed by aggregating the features with both attentions at once. It helps the model to consider where to generate and where to warp.
  • Figure 4: Visualization of augmented self-attention map. In augmented self-attention map $A$, the original self-attention part $A_\mathrm{self}$ is more attentive to regions requiring generative priors, such as occluded or ill-warped areas (top), while the concatenated cross-view attention part $A_\mathrm{cross}$ focuses on regions that can be reliably warped from the input view (bottom). By aggregating both attentions at once, the model naturally determines which regions to generate and which to warp.
  • Figure 5: Qualitative results with images in the wild. We compare our method with Stable Diffusion Inpainting rombach2022high on in-the-wild images. More qualitatvie results can be found in Fig. \ref{['fig:additional_qual_wild']} of Appendix.
  • ...and 7 more figures