HiFi-123: Towards High-fidelity One Image to 3D Content Generation
Wangbo Yu, Li Yuan, Yan-Pei Cao, Xiangjun Gao, Xiaoyu Li, Wenbo Hu, Long Quan, Ying Shan, Yonghong Tian
TL;DR
HiFi-123 tackles the challenge of generating high-fidelity, multi-view-consistent 3D content from a single image. It introduces two core innovations: RGNV, a Reference-Guided Novel View Enhancement that uses depth-conditioned DDIM inversion and attention injection to transfer reference textures to novel views, and RGSD, a Reference-Guided State Distillation loss that guides refinement of image-to-3D pipelines by distilling intermediate states from the RGNV process. When integrated, RGNV serves as a plug-in to diffusion-based zero-shot view synthesis, while RGSD improves the texture and color realism in the refined 3D representations, achieving state-of-the-art performance on both zero-shot novel-view synthesis and image-to-3D generation. The approach is validated through quantitative metrics and qualitative comparisons, with practical implications for accessible, high-quality 3D asset creation, albeit with limitations related to reliance on an initial coarse view and potential 3D ambiguities from a single reference image.
Abstract
Recent advances in diffusion models have enabled 3D generation from a single image. However, current methods often produce suboptimal results for novel views, with blurred textures and deviations from the reference image, limiting their practical applications. In this paper, we introduce HiFi-123, a method designed for high-fidelity and multi-view consistent 3D generation. Our contributions are twofold: First, we propose a Reference-Guided Novel View Enhancement (RGNV) technique that significantly improves the fidelity of diffusion-based zero-shot novel view synthesis methods. Second, capitalizing on the RGNV, we present a novel Reference-Guided State Distillation (RGSD) loss. When incorporated into the optimization-based image-to-3D pipeline, our method significantly improves 3D generation quality, achieving state-of-the-art performance. Comprehensive evaluations demonstrate the effectiveness of our approach over existing methods, both qualitatively and quantitatively. Video results are available on the project page.
