Table of Contents
Fetching ...

LSS3D: Learnable Spatial Shifting for Consistent and High-Quality 3D Generation from Single-Image

Zhuojiang Cai, Yiheng Zhang, Meitong Guo, Mingdao Wang, Yuwang Wang

TL;DR

LSS3D addresses the challenge of producing high-quality, view-consistent 3D reconstructions from a single image by introducing learnable spatial shifting for each generated view, guided by a reconstructing mesh. The method jointly optimizes 2D and 3D shifting maps with the mesh, uses a coarse mesh initialization, and applies a texture-shifting projection to ensure coherent textures across views. A robust input-view elevation estimation aligns the input perspective with the evolving geometry, improving robustness to elevated viewpoints. Evaluations on the Google Scanned Objects dataset show leading geometry and texture metrics, with ablations confirming the effectiveness of the shifting strategy and demonstrating strong robustness and potential for plug-and-play improvements on existing 3D generation pipelines.

Abstract

Recently, multi-view diffusion-based 3D generation methods have gained significant attention. However, these methods often suffer from shape and texture misalignment across generated multi-view images, leading to low-quality 3D generation results, such as incomplete geometric details and textural ghosting. Some methods are mainly optimized for the frontal perspective and exhibit poor robustness to oblique perspective inputs. In this paper, to tackle the above challenges, we propose a high-quality image-to-3D approach, named LSS3D, with learnable spatial shifting to explicitly and effectively handle the multiview inconsistencies and non-frontal input view. Specifically, we assign learnable spatial shifting parameters to each view, and adjust each view towards a spatially consistent target, guided by the reconstructed mesh, resulting in high-quality 3D generation with more complete geometric details and clean textures. Besides, we include the input view as an extra constraint for the optimization, further enhancing robustness to non-frontal input angles, especially for elevated viewpoint inputs. We also provide a comprehensive quantitative evaluation pipeline that can contribute to the community in performance comparisons. Extensive experiments demonstrate that our method consistently achieves leading results in both geometric and texture evaluation metrics across more flexible input viewpoints.

LSS3D: Learnable Spatial Shifting for Consistent and High-Quality 3D Generation from Single-Image

TL;DR

LSS3D addresses the challenge of producing high-quality, view-consistent 3D reconstructions from a single image by introducing learnable spatial shifting for each generated view, guided by a reconstructing mesh. The method jointly optimizes 2D and 3D shifting maps with the mesh, uses a coarse mesh initialization, and applies a texture-shifting projection to ensure coherent textures across views. A robust input-view elevation estimation aligns the input perspective with the evolving geometry, improving robustness to elevated viewpoints. Evaluations on the Google Scanned Objects dataset show leading geometry and texture metrics, with ablations confirming the effectiveness of the shifting strategy and demonstrating strong robustness and potential for plug-and-play improvements on existing 3D generation pipelines.

Abstract

Recently, multi-view diffusion-based 3D generation methods have gained significant attention. However, these methods often suffer from shape and texture misalignment across generated multi-view images, leading to low-quality 3D generation results, such as incomplete geometric details and textural ghosting. Some methods are mainly optimized for the frontal perspective and exhibit poor robustness to oblique perspective inputs. In this paper, to tackle the above challenges, we propose a high-quality image-to-3D approach, named LSS3D, with learnable spatial shifting to explicitly and effectively handle the multiview inconsistencies and non-frontal input view. Specifically, we assign learnable spatial shifting parameters to each view, and adjust each view towards a spatially consistent target, guided by the reconstructed mesh, resulting in high-quality 3D generation with more complete geometric details and clean textures. Besides, we include the input view as an extra constraint for the optimization, further enhancing robustness to non-frontal input angles, especially for elevated viewpoint inputs. We also provide a comprehensive quantitative evaluation pipeline that can contribute to the community in performance comparisons. Extensive experiments demonstrate that our method consistently achieves leading results in both geometric and texture evaluation metrics across more flexible input viewpoints.

Paper Structure

This paper contains 11 sections, 8 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The results with the shift operation (2D shifting on texture and 3D shifting on normal maps) exhibit clearer textures and more accurate geometry. The two cases are from GSO datasetdowns_google_2022 .
  • Figure 2: Overview of LSS3D. Given an image of an object, our method first employs a multi-view diffusion model to generate six-view images and normal maps, which are then used to quickly reconstruct a coarse mesh. Next, the normal maps are used to iteratively optimize the mesh, with the normal maps passing through a learnable spatial shifting wrapper. Both the shifting maps and the mesh are optimized, which is completed quickly ($\sim$10 s). Finally, multi-view images, adjusted with a texture shifting wrapper, are projected onto the mesh.
  • Figure 3: Qualitative comparison. Our approach provides smooth and detailed geometry with clean texture. (Zoom in to see more details.)
  • Figure 4: Our method shows higher robustness to input views.
  • Figure 5: LSS3D plug-and-play refinement of results from other 3D generation methods (e.g., Trellis xiang2024structured)
  • ...and 1 more figures