LSS3D: Learnable Spatial Shifting for Consistent and High-Quality 3D Generation from Single-Image
Zhuojiang Cai, Yiheng Zhang, Meitong Guo, Mingdao Wang, Yuwang Wang
TL;DR
LSS3D addresses the challenge of producing high-quality, view-consistent 3D reconstructions from a single image by introducing learnable spatial shifting for each generated view, guided by a reconstructing mesh. The method jointly optimizes 2D and 3D shifting maps with the mesh, uses a coarse mesh initialization, and applies a texture-shifting projection to ensure coherent textures across views. A robust input-view elevation estimation aligns the input perspective with the evolving geometry, improving robustness to elevated viewpoints. Evaluations on the Google Scanned Objects dataset show leading geometry and texture metrics, with ablations confirming the effectiveness of the shifting strategy and demonstrating strong robustness and potential for plug-and-play improvements on existing 3D generation pipelines.
Abstract
Recently, multi-view diffusion-based 3D generation methods have gained significant attention. However, these methods often suffer from shape and texture misalignment across generated multi-view images, leading to low-quality 3D generation results, such as incomplete geometric details and textural ghosting. Some methods are mainly optimized for the frontal perspective and exhibit poor robustness to oblique perspective inputs. In this paper, to tackle the above challenges, we propose a high-quality image-to-3D approach, named LSS3D, with learnable spatial shifting to explicitly and effectively handle the multiview inconsistencies and non-frontal input view. Specifically, we assign learnable spatial shifting parameters to each view, and adjust each view towards a spatially consistent target, guided by the reconstructed mesh, resulting in high-quality 3D generation with more complete geometric details and clean textures. Besides, we include the input view as an extra constraint for the optimization, further enhancing robustness to non-frontal input angles, especially for elevated viewpoint inputs. We also provide a comprehensive quantitative evaluation pipeline that can contribute to the community in performance comparisons. Extensive experiments demonstrate that our method consistently achieves leading results in both geometric and texture evaluation metrics across more flexible input viewpoints.
