Table of Contents
Fetching ...

Novel View Synthesis using DDIM Inversion

Sehajdeep Singh, A V Subramanyam, Aditya Gupta, Sahil Gupta

TL;DR

This work proposes a novel fusion strategy that exploits the inherent noise correlation structure observed in DDIM inversion and uses the fused latent as the initial condition for DDIM sampling, leveraging the generative prior of the pretrained diffusion model.

Abstract

Synthesizing novel views from a single input image is a challenging task. It requires extrapolating the 3D structure of a scene while inferring details in occluded regions, and maintaining geometric consistency across viewpoints. Many existing methods must fine-tune large diffusion backbones using multiple views or train a diffusion model from scratch, which is extremely expensive. Additionally, they suffer from blurry reconstruction and poor generalization. This gap presents the opportunity to explore an explicit lightweight view translation framework that can directly utilize the high-fidelity generative capabilities of a pretrained diffusion model while reconstructing a scene from a novel view. Given the DDIM-inverted latent of a single input image, we employ a camera pose-conditioned translation U-Net, TUNet, to predict the inverted latent corresponding to the desired target view. However, the image sampled using the predicted latent may result in a blurry reconstruction. To this end, we propose a novel fusion strategy that exploits the inherent noise correlation structure observed in DDIM inversion. The proposed fusion strategy helps preserve the texture and fine-grained details. To synthesize the novel view, we use the fused latent as the initial condition for DDIM sampling, leveraging the generative prior of the pretrained diffusion model. Extensive experiments on MVImgNet demonstrate that our method outperforms existing methods.

Novel View Synthesis using DDIM Inversion

TL;DR

This work proposes a novel fusion strategy that exploits the inherent noise correlation structure observed in DDIM inversion and uses the fused latent as the initial condition for DDIM sampling, leveraging the generative prior of the pretrained diffusion model.

Abstract

Synthesizing novel views from a single input image is a challenging task. It requires extrapolating the 3D structure of a scene while inferring details in occluded regions, and maintaining geometric consistency across viewpoints. Many existing methods must fine-tune large diffusion backbones using multiple views or train a diffusion model from scratch, which is extremely expensive. Additionally, they suffer from blurry reconstruction and poor generalization. This gap presents the opportunity to explore an explicit lightweight view translation framework that can directly utilize the high-fidelity generative capabilities of a pretrained diffusion model while reconstructing a scene from a novel view. Given the DDIM-inverted latent of a single input image, we employ a camera pose-conditioned translation U-Net, TUNet, to predict the inverted latent corresponding to the desired target view. However, the image sampled using the predicted latent may result in a blurry reconstruction. To this end, we propose a novel fusion strategy that exploits the inherent noise correlation structure observed in DDIM inversion. The proposed fusion strategy helps preserve the texture and fine-grained details. To synthesize the novel view, we use the fused latent as the initial condition for DDIM sampling, leveraging the generative prior of the pretrained diffusion model. Extensive experiments on MVImgNet demonstrate that our method outperforms existing methods.

Paper Structure

This paper contains 27 sections, 13 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Qualitative comparison on RealEstate10K.
  • Figure 2: Overview: Given a single reference image $\mathbf{x_{\text{ref}}}$, we first apply DDIM inversion up to $t=600$ to obtain the mean latent $\mathbf{z}_{\text{ref},\mu}^{\text{inv}}$. This, together with camera intrinsics/extrinsics, class embeddings, and ray information, is fed into our translation network TUNet. TUNet predicts the target-view mean latent $\tilde{\mathbf{z}}_{\text{tar},\mu}^{\text{inv}}$, which we combine with the corresponding noise component via one of our fusion strategies to form the initial DDIM latent $\tilde{\mathbf{z}}_{tar}^{\text{inv}}$. Finally, this latent is sampled by a pre-trained diffusion model to synthesize the novel view image.
  • Figure 2: Generating multiple frames with single input image from MVImgNet.
  • Figure 3: Mean of the DDIM inverted latent at $t=400, 600, 800$, respectively. Latent is decoded using VAE for visualization. Original 512$\times$512 image. At $t=400$, the mean reflects dominant low frequencies which precludes generation of diverse images. At $t=800$, the low frequency component is extremely weak. $t=600$ provides a weak yet effective signal for translation.
  • Figure 3: Qualitative ablation results.
  • ...and 8 more figures