Table of Contents
Fetching ...

Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior

Fangfu Liu, Diankun Wu, Yi Wei, Yongming Rao, Yueqi Duan

TL;DR

Sherpa3D addresses the challenge of high-fidelity, open-vocabulary text-to-3D generation under limited 3D data by leveraging a coarse 3D prior from a 3D diffusion model to guide 2D lifting. It introduces structural and semantic guidance derived from the prior, and employs an annealing schedule to balance 3D guidance with 2D diffusion capabilities, yielding rich, multi-view-consistent 3D assets. Extensive experiments demonstrate superior 3D coherence and visual fidelity compared with state-of-the-art baselines and confirm robustness across prompts. The approach offers a practical, efficient path to high-quality text-to-3D content suitable for real-time graphics pipelines and diverse applications.

Abstract

Recently, 3D content creation from text prompts has demonstrated remarkable progress by utilizing 2D and 3D diffusion models. While 3D diffusion models ensure great multi-view consistency, their ability to generate high-quality and diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. However, 2D lifting methods suffer from inherent view-agnostic ambiguity thereby leading to serious multi-face Janus issues, where text prompts fail to provide sufficient guidance to learn coherent 3D results. Instead of retraining a costly viewpoint-aware model, we study how to fully exploit easily accessible coarse 3D knowledge to enhance the prompts and guide 2D lifting optimization for refinement. In this paper, we propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously. Specifically, we design a pair of guiding strategies derived from the coarse 3D prior generated by the 3D diffusion model: a structural guidance for geometric fidelity and a semantic guidance for 3D coherence. Employing the two types of guidance, the 2D diffusion model enriches the 3D content with diversified and high-quality results. Extensive experiments show the superiority of our Sherpa3D over the state-of-the-art text-to-3D methods in terms of quality and 3D consistency.

Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior

TL;DR

Sherpa3D addresses the challenge of high-fidelity, open-vocabulary text-to-3D generation under limited 3D data by leveraging a coarse 3D prior from a 3D diffusion model to guide 2D lifting. It introduces structural and semantic guidance derived from the prior, and employs an annealing schedule to balance 3D guidance with 2D diffusion capabilities, yielding rich, multi-view-consistent 3D assets. Extensive experiments demonstrate superior 3D coherence and visual fidelity compared with state-of-the-art baselines and confirm robustness across prompts. The approach offers a practical, efficient path to high-quality text-to-3D content suitable for real-time graphics pipelines and diverse applications.

Abstract

Recently, 3D content creation from text prompts has demonstrated remarkable progress by utilizing 2D and 3D diffusion models. While 3D diffusion models ensure great multi-view consistency, their ability to generate high-quality and diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. However, 2D lifting methods suffer from inherent view-agnostic ambiguity thereby leading to serious multi-face Janus issues, where text prompts fail to provide sufficient guidance to learn coherent 3D results. Instead of retraining a costly viewpoint-aware model, we study how to fully exploit easily accessible coarse 3D knowledge to enhance the prompts and guide 2D lifting optimization for refinement. In this paper, we propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously. Specifically, we design a pair of guiding strategies derived from the coarse 3D prior generated by the 3D diffusion model: a structural guidance for geometric fidelity and a semantic guidance for 3D coherence. Employing the two types of guidance, the 2D diffusion model enriches the 3D content with diversified and high-quality results. Extensive experiments show the superiority of our Sherpa3D over the state-of-the-art text-to-3D methods in terms of quality and 3D consistency.
Paper Structure (24 sections, 15 equations, 14 figures, 2 tables)

This paper contains 24 sections, 15 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Gallery of Sherpa3D: Blender rendering for various textured meshes from Sherpa3D, which is able to generate high-fidelity, diverse, and multi-view consistent 3D contents with input text prompts. Our method is also compatible with popular graphics engines.
  • Figure 2: Pipeline of our Sherpa3D. Given a text as input, we first prompt 3D diffusion to build a coarse 3D prior $M$ encoded in the geometry model (e.g., DMTet). Next, we render the normal map of the extracted mesh in DMTet and derive two guiding strategies from $M$. (a) Structural Guidance: we utilize the structural descriptor to compute salient geometric features for preserving geometry fidelity (e.g., without a pockmarked face problem). (b) Semantic Guidance: we leverage a semantic encoder (e.g., CLIP) to extract high-level information for keeping 3D consistency (e.g., without multi-face issues). Employing the two guidance in 2D lifting process, we use the normal map as shape encoding of the 2D diffusion model and unleash its power to generate high-quality and diversified results with 3D coherence. Then we achieve the final 3D results via photorealistic rendering through appearance modeling. ("Everest's summit eludes many without Sherpa.")
  • Figure 3: Qualitative comparisons with baseline methods across different views ($0^{\circ}$ and $180^\circ$). We can observe that baseline methods suffer from severe multi-face issues while our Sherpa3D can achieve better quality and 3D coherence.
  • Figure 4: Qualitative comparisons with baseline methods across different views ($-30^{\circ}$ and $150^\circ$).
  • Figure 5: Ablation study of our method. The generation is based on the text prompt "a head of the Terracotta Army". We ablate the design choices of structural guidance, semantic guidance (Sec. \ref{['sec: 3D guidance']}), and the step annealing technique (Sec. \ref{['sec: optimization']}).
  • ...and 9 more figures