Table of Contents
Fetching ...

2L3: Lifting Imperfect Generated 2D Images into Accurate 3D

Yizheng Chen, Rengan Xie, Qi Ye, Sen Yang, Zixuan Xie, Tianxiao Chen, Rong Li, Yuchi Huo

TL;DR

This work tackles 3D reconstruction from imperfect generated multi-view images by introducing a plug-in framework that integrates intrinsic decomposition guidance, per-frame transient monocular priors, and a view augmentation fusion strategy. The method jointly optimizes a neural SDF-based geometry and a textured appearance using a two-stage process: geometry+albedo recovery followed by texture reconstruction, all guided by cross-view semantic consistency and lighting-robust losses. Key contributions include leveraging intrinsic decomposition to remove shading cues, a mononormal prior with per-frame encoding to stabilize geometry across views, and dense view supervision through semantic and pixel-level losses, enabling improved performance across multiple MV generators. Empirical results show significant gains in Chamfer Distance and PSNR compared to state-of-the-art methods, demonstrating practical impact for integrating 2D generative models with 3D reconstruction pipelines.

Abstract

Reconstructing 3D objects from a single image is an intriguing but challenging problem. One promising solution is to utilize multi-view (MV) 3D reconstruction to fuse generated MV images into consistent 3D objects. However, the generated images usually suffer from inconsistent lighting, misaligned geometry, and sparse views, leading to poor reconstruction quality. To cope with these problems, we present a novel 3D reconstruction framework that leverages intrinsic decomposition guidance, transient-mono prior guidance, and view augmentation to cope with the three issues, respectively. Specifically, we first leverage to decouple the shading information from the generated images to reduce the impact of inconsistent lighting; then, we introduce mono prior with view-dependent transient encoding to enhance the reconstructed normal; and finally, we design a view augmentation fusion strategy that minimizes pixel-level loss in generated sparse views and semantic loss in augmented random views, resulting in view-consistent geometry and detailed textures. Our approach, therefore, enables the integration of a pre-trained MV image generator and a neural network-based volumetric signed distance function (SDF) representation for a single image to 3D object reconstruction. We evaluate our framework on various datasets and demonstrate its superior performance in both quantitative and qualitative assessments, signifying a significant advancement in 3D object reconstruction. Compared with the latest state-of-the-art method Syncdreamer~\cite{liu2023syncdreamer}, we reduce the Chamfer Distance error by about 36\% and improve PSNR by about 30\% .

2L3: Lifting Imperfect Generated 2D Images into Accurate 3D

TL;DR

This work tackles 3D reconstruction from imperfect generated multi-view images by introducing a plug-in framework that integrates intrinsic decomposition guidance, per-frame transient monocular priors, and a view augmentation fusion strategy. The method jointly optimizes a neural SDF-based geometry and a textured appearance using a two-stage process: geometry+albedo recovery followed by texture reconstruction, all guided by cross-view semantic consistency and lighting-robust losses. Key contributions include leveraging intrinsic decomposition to remove shading cues, a mononormal prior with per-frame encoding to stabilize geometry across views, and dense view supervision through semantic and pixel-level losses, enabling improved performance across multiple MV generators. Empirical results show significant gains in Chamfer Distance and PSNR compared to state-of-the-art methods, demonstrating practical impact for integrating 2D generative models with 3D reconstruction pipelines.

Abstract

Reconstructing 3D objects from a single image is an intriguing but challenging problem. One promising solution is to utilize multi-view (MV) 3D reconstruction to fuse generated MV images into consistent 3D objects. However, the generated images usually suffer from inconsistent lighting, misaligned geometry, and sparse views, leading to poor reconstruction quality. To cope with these problems, we present a novel 3D reconstruction framework that leverages intrinsic decomposition guidance, transient-mono prior guidance, and view augmentation to cope with the three issues, respectively. Specifically, we first leverage to decouple the shading information from the generated images to reduce the impact of inconsistent lighting; then, we introduce mono prior with view-dependent transient encoding to enhance the reconstructed normal; and finally, we design a view augmentation fusion strategy that minimizes pixel-level loss in generated sparse views and semantic loss in augmented random views, resulting in view-consistent geometry and detailed textures. Our approach, therefore, enables the integration of a pre-trained MV image generator and a neural network-based volumetric signed distance function (SDF) representation for a single image to 3D object reconstruction. We evaluate our framework on various datasets and demonstrate its superior performance in both quantitative and qualitative assessments, signifying a significant advancement in 3D object reconstruction. Compared with the latest state-of-the-art method Syncdreamer~\cite{liu2023syncdreamer}, we reduce the Chamfer Distance error by about 36\% and improve PSNR by about 30\% .
Paper Structure (24 sections, 11 equations, 13 figures, 3 tables)

This paper contains 24 sections, 11 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: The geometry misalignment and lighting inconsistency generally exist in state-of-the-art MV generation models like GAN chan2022efficient and diffusion liu2023syncdreamer.
  • Figure 2: Our pipeline of 3D mesh reconstruction from generated multi-view images. Off-shelf models for 2D images generation, instrinsic decomposition and monocular depth estimation are leveraged to generate sparse multi-view images, and their normal and albedo maps for supervision in the reconstruction stages. Our reconstruction is decomposed into two stages to produce view-consistency 3D results. Stage 1: reconstructing the geometry and albedo field with the guidance of normal and albedo maps. Stage 2: reconstructing shaded texture with highlight and shadow details. Further, per-frame encoding and view augmentation fusion schema are designed to enhance view consistency and alleviate under-supervision of sparse views.
  • Figure 3: The red line of view 1 represents a misaligned boundary in view 2, which might lead to a wrong contour on the surface of view 2, as shown by the orange line. However, the mono normal prior of view 2 enforces a smooth constraint on the same region (the green line), and thus eliminates the wrong contour in the final reconstructed geometry.
  • Figure 4: Visual comparison of reconstruction using baseline and our framework on text-generated images shi2023mvdream. Our method can be extended to multi-view images produced using various methods based on different inputs.
  • Figure 5: Visual comparison of without and with the intrinsic decomposition guidance in stage one.
  • ...and 8 more figures