Table of Contents
Fetching ...

HiFi-123: Towards High-fidelity One Image to 3D Content Generation

Wangbo Yu, Li Yuan, Yan-Pei Cao, Xiangjun Gao, Xiaoyu Li, Wenbo Hu, Long Quan, Ying Shan, Yonghong Tian

TL;DR

HiFi-123 tackles the challenge of generating high-fidelity, multi-view-consistent 3D content from a single image. It introduces two core innovations: RGNV, a Reference-Guided Novel View Enhancement that uses depth-conditioned DDIM inversion and attention injection to transfer reference textures to novel views, and RGSD, a Reference-Guided State Distillation loss that guides refinement of image-to-3D pipelines by distilling intermediate states from the RGNV process. When integrated, RGNV serves as a plug-in to diffusion-based zero-shot view synthesis, while RGSD improves the texture and color realism in the refined 3D representations, achieving state-of-the-art performance on both zero-shot novel-view synthesis and image-to-3D generation. The approach is validated through quantitative metrics and qualitative comparisons, with practical implications for accessible, high-quality 3D asset creation, albeit with limitations related to reliance on an initial coarse view and potential 3D ambiguities from a single reference image.

Abstract

Recent advances in diffusion models have enabled 3D generation from a single image. However, current methods often produce suboptimal results for novel views, with blurred textures and deviations from the reference image, limiting their practical applications. In this paper, we introduce HiFi-123, a method designed for high-fidelity and multi-view consistent 3D generation. Our contributions are twofold: First, we propose a Reference-Guided Novel View Enhancement (RGNV) technique that significantly improves the fidelity of diffusion-based zero-shot novel view synthesis methods. Second, capitalizing on the RGNV, we present a novel Reference-Guided State Distillation (RGSD) loss. When incorporated into the optimization-based image-to-3D pipeline, our method significantly improves 3D generation quality, achieving state-of-the-art performance. Comprehensive evaluations demonstrate the effectiveness of our approach over existing methods, both qualitatively and quantitatively. Video results are available on the project page.

HiFi-123: Towards High-fidelity One Image to 3D Content Generation

TL;DR

HiFi-123 tackles the challenge of generating high-fidelity, multi-view-consistent 3D content from a single image. It introduces two core innovations: RGNV, a Reference-Guided Novel View Enhancement that uses depth-conditioned DDIM inversion and attention injection to transfer reference textures to novel views, and RGSD, a Reference-Guided State Distillation loss that guides refinement of image-to-3D pipelines by distilling intermediate states from the RGNV process. When integrated, RGNV serves as a plug-in to diffusion-based zero-shot view synthesis, while RGSD improves the texture and color realism in the refined 3D representations, achieving state-of-the-art performance on both zero-shot novel-view synthesis and image-to-3D generation. The approach is validated through quantitative metrics and qualitative comparisons, with practical implications for accessible, high-quality 3D asset creation, albeit with limitations related to reliance on an initial coarse view and potential 3D ambiguities from a single reference image.

Abstract

Recent advances in diffusion models have enabled 3D generation from a single image. However, current methods often produce suboptimal results for novel views, with blurred textures and deviations from the reference image, limiting their practical applications. In this paper, we introduce HiFi-123, a method designed for high-fidelity and multi-view consistent 3D generation. Our contributions are twofold: First, we propose a Reference-Guided Novel View Enhancement (RGNV) technique that significantly improves the fidelity of diffusion-based zero-shot novel view synthesis methods. Second, capitalizing on the RGNV, we present a novel Reference-Guided State Distillation (RGSD) loss. When incorporated into the optimization-based image-to-3D pipeline, our method significantly improves 3D generation quality, achieving state-of-the-art performance. Comprehensive evaluations demonstrate the effectiveness of our approach over existing methods, both qualitatively and quantitatively. Video results are available on the project page.
Paper Structure (28 sections, 5 equations, 17 figures, 6 tables, 1 algorithm)

This paper contains 28 sections, 5 equations, 17 figures, 6 tables, 1 algorithm.

Figures (17)

  • Figure 1: HiFi-123 is capable of generating high-fidelity 3D content from a single reference image. In each block above, we display the reference image (top left corner) along with the rendered novel views and normal of the generated 3D content. The presented novel views demonstrate that our approach maintains consistency and high-fidelity with the reference image, even in views significantly deviating from the reference view.
  • Figure 1: Comparison between depth-based DDIM inversion, regular DDIM inversion and optimization-based Null-text inversion mokady2023null. Example images partly from mokady2023null.
  • Figure 2: Illustration of the RGNV pipeline. It performs depth-based DDIM inversion and sampling on both the reference image and coarse novel view, and utilizes attention injection to transfer detail textures from the reference image to the coarse novel view.
  • Figure 2: Robustness for depth condition.
  • Figure 3: Image-to-3D generation pipeline. We utilize two stages to generate high-fidelity 3D contents. In the coarse stage, we optimize an Instant-NGP representation using SDS loss, reference view reconstruction loss, depth loss, and normal loss. In the refine stage, we export DMTet representation and use our proposed RGSD loss to supervise training.
  • ...and 12 more figures