Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging
Chongjie Ye, Yushuang Wu, Ziteng Lu, Jiahao Chang, Xiaoyang Guo, Jiaqing Zhou, Hao Zhao, Xiaoguang Han
TL;DR
<3-5 sentence high-level summary> Hi3DGen introduces a normal-bridged paradigm for high-fidelity 3D geometry generation from images by using normal maps as an intermediate representation. It couples a Noise-Injected Regressive Normal Estimator (NiRNE) with a Normal-Regularized Latent Diffusion (NoRLD) to produce sharp, detailed normals and geometry, guided online by normal supervision. A large-scale DetailVerse synthetic dataset supports training, enabling richer geometry and better generalization to real-world images. Extensive experiments and user studies show improved fidelity over state-of-the-art methods, validating the effectiveness of normal bridging for RGB-to-3D tasks with practical implications for realistic 3D content creation.
Abstract
With the growing demand for high-fidelity 3D models from 2D images, existing methods still face significant challenges in accurately reproducing fine-grained geometric details due to limitations in domain gaps and inherent ambiguities in RGB images. To address these issues, we propose Hi3DGen, a novel framework for generating high-fidelity 3D geometry from images via normal bridging. Hi3DGen consists of three key components: (1) an image-to-normal estimator that decouples the low-high frequency image pattern with noise injection and dual-stream training to achieve generalizable, stable, and sharp estimation; (2) a normal-to-geometry learning approach that uses normal-regularized latent diffusion learning to enhance 3D geometry generation fidelity; and (3) a 3D data synthesis pipeline that constructs a high-quality dataset to support training. Extensive experiments demonstrate the effectiveness and superiority of our framework in generating rich geometric details, outperforming state-of-the-art methods in terms of fidelity. Our work provides a new direction for high-fidelity 3D geometry generation from images by leveraging normal maps as an intermediate representation.
