Table of Contents
Fetching ...

VI3DRM:Towards meticulous 3D Reconstruction from Sparse Views via Photo-Realistic Novel View Synthesis

Hao Chen, Jiafu Wu, Ying Jin, Jinlong Peng, Xiaofeng Mao, Mingmin Chi, Mufeng Yao, Bo Peng, Jian Li, Yun Cao

TL;DR

The Visual Isotropy 3D Reconstruction Model (VI3DRM), a diffusion-based sparse views 3D reconstruction model that operates within an ID consistent and perspective-disentangled 3D latent space, is introduced, capable of generating highly realistic images that are indistinguishable from real photographs.

Abstract

Recently, methods like Zero-1-2-3 have focused on single-view based 3D reconstruction and have achieved remarkable success. However, their predictions for unseen areas heavily rely on the inductive bias of large-scale pretrained diffusion models. Although subsequent work, such as DreamComposer, attempts to make predictions more controllable by incorporating additional views, the results remain unrealistic due to feature entanglement in the vanilla latent space, including factors such as lighting, material, and structure. To address these issues, we introduce the Visual Isotropy 3D Reconstruction Model (VI3DRM), a diffusion-based sparse views 3D reconstruction model that operates within an ID consistent and perspective-disentangled 3D latent space. By facilitating the disentanglement of semantic information, color, material properties and lighting, VI3DRM is capable of generating highly realistic images that are indistinguishable from real photographs. By leveraging both real and synthesized images, our approach enables the accurate construction of pointmaps, ultimately producing finely textured meshes or point clouds. On the NVS task, tested on the GSO dataset, VI3DRM significantly outperforms state-of-the-art method DreamComposer, achieving a PSNR of 38.61, an SSIM of 0.929, and an LPIPS of 0.027. Code will be made available upon publication.

VI3DRM:Towards meticulous 3D Reconstruction from Sparse Views via Photo-Realistic Novel View Synthesis

TL;DR

The Visual Isotropy 3D Reconstruction Model (VI3DRM), a diffusion-based sparse views 3D reconstruction model that operates within an ID consistent and perspective-disentangled 3D latent space, is introduced, capable of generating highly realistic images that are indistinguishable from real photographs.

Abstract

Recently, methods like Zero-1-2-3 have focused on single-view based 3D reconstruction and have achieved remarkable success. However, their predictions for unseen areas heavily rely on the inductive bias of large-scale pretrained diffusion models. Although subsequent work, such as DreamComposer, attempts to make predictions more controllable by incorporating additional views, the results remain unrealistic due to feature entanglement in the vanilla latent space, including factors such as lighting, material, and structure. To address these issues, we introduce the Visual Isotropy 3D Reconstruction Model (VI3DRM), a diffusion-based sparse views 3D reconstruction model that operates within an ID consistent and perspective-disentangled 3D latent space. By facilitating the disentanglement of semantic information, color, material properties and lighting, VI3DRM is capable of generating highly realistic images that are indistinguishable from real photographs. By leveraging both real and synthesized images, our approach enables the accurate construction of pointmaps, ultimately producing finely textured meshes or point clouds. On the NVS task, tested on the GSO dataset, VI3DRM significantly outperforms state-of-the-art method DreamComposer, achieving a PSNR of 38.61, an SSIM of 0.929, and an LPIPS of 0.027. Code will be made available upon publication.
Paper Structure (25 sections, 5 equations, 7 figures, 1 table)

This paper contains 25 sections, 5 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Latent Visualization. Twenty randomly sampled GSO objects are encoded in both the Vanilla Latent Space (left) and our ID-Consistent Latent Space (right). The original feature visualization (left) is scattered and disordered, whereas ours tightly clusters different views of the same object in the latent space.
  • Figure 2: Zero-shot NVS on GSO dataset, our method outperforms previous approaches by a large margin in terms of both texture and structural accuracy
  • Figure 3: Our method involves three primary steps: Step 1: We encode four known view images into our ID-consistent Latent Space and extract semantic embeddings as global condition. These conditions guide the LDM model to generate four novel views. Step 2: We feed both the original and synthesized images into Dust3r to construct optimized pointmaps. Step 3: Meshes or point clouds are extracted from the pointmaps.
  • Figure 4: ID-Consistent and Perspective-Disentangled 3D Latent Space Training: By optimizing with our proposed $L_{ID}$, our latent space effectively disentangle identity features from view-dependent characteristics.
  • Figure 5: From known viewpoint Conditions to generation of novel perspectives.
  • ...and 2 more figures