DreamPolisher: Towards High-Quality Text-to-3D Generation via Geometric Diffusion

Yuanze Lin; Ronald Clark; Philip Torr

DreamPolisher: Towards High-Quality Text-to-3D Generation via Geometric Diffusion

Yuanze Lin, Ronald Clark, Philip Torr

TL;DR

DreamPolisher tackles the challenge of producing high-fidelity, view-consistent 3D assets from text prompts by marrying 3D Gaussian Splatting with geometric diffusion guidance and a ControlNet-based appearance refiner. The method uses a two-stage pipeline: a coarse Stage 1 initialized from a text-to-point diffusion prior to produce a geometry-robust 3D prior, followed by Stage 2 appearance refinement that conditions on camera information and scene coordinates to boost texture detail and cross-view coherence. A novel view-consistency mechanism via a Scene Coordinate Renderer and a view-consistency loss further enforces multi-view alignment, while ISM-based optimization accelerates training relative to SDS-based approaches. Empirical results demonstrate improved visual quality and cross-view consistency over strong baselines like DreamGaussian, GaussianDreamer, and LucidDreamer, with appreciable efficiency (about 30 minutes per object on a single GPU). Overall, DreamPolisher narrows the quality gap between text-to-3D and text-to-image-to-3D methods while maintaining practical training efficiency.

Abstract

We present DreamPolisher, a novel Gaussian Splatting based method with geometric guidance, tailored to learn cross-view consistency and intricate detail from textual descriptions. While recent progress on text-to-3D generation methods have been promising, prevailing methods often fail to ensure view-consistency and textural richness. This problem becomes particularly noticeable for methods that work with text input alone. To address this, we propose a two-stage Gaussian Splatting based approach that enforces geometric consistency among views. Initially, a coarse 3D generation undergoes refinement via geometric optimization. Subsequently, we use a ControlNet driven refiner coupled with the geometric consistency term to improve both texture fidelity and overall consistency of the generated 3D asset. Empirical evaluations across diverse textual prompts spanning various object categories demonstrate the efficacy of DreamPolisher in generating consistent and realistic 3D objects, aligning closely with the semantics of the textual instructions.

DreamPolisher: Towards High-Quality Text-to-3D Generation via Geometric Diffusion

TL;DR

Abstract

Paper Structure (23 sections, 11 equations, 9 figures, 1 table)

This paper contains 23 sections, 11 equations, 9 figures, 1 table.

Introduction
Related Works
Text-to-3D Generation
Diffusion Models
Differentiable 3D Representations
Preliminary
Diffusion Models
ControlNet
Interval Score Matching
Methodology
Stage 1: Coarse Optimization
Stage 2: Appearance Refinement
Camera Encoder.
ControlNet Refiner.
View-Consistent Geometric Guidance
...and 8 more sections

Figures (9)

Figure 1: Overview. Given the user-provided textual instructions, DreamPolisher can generate high-quality and view-consistent 3D objects.
Figure 2: Comparison with existing methods. The two predominant approaches for generating 3D objects from text are text-to-image-to-3D (e.g., DreamCraft3D sun2023dreamcraft3d) and text-to-3D (e.g., LucidDreamer liang2023luciddreamer) approaches. DreamCraft3D requires 3 hours and utilizes both a prompt and input image to generate a 3D object. While LucidDreamer can generate a 3D object with 35 minutes, it still struggles with consistency problems. Our method produces high-quality and visually coherent 3D objects quickly.
Figure 3: Coarse Optimization (Stage 1). The text prompt is firstly fed into a pre-trained text-to-point diffusion model, e.g., Point-Enichol2022point to obtain the corresponding point cloud, which is used to initialize the 3D Gaussians. After that, we use 3D Gaussian Splatting to optimize the object Gaussians guided by the pre-trained text-to-image diffusion model.
Figure 4: Appearance Refinement (Stage 2). We render multiple views from the 3D object optimized by the coarse stage, and feed them to the Scene Coordinate Renderer. The rendered scene coordinates are then used in the view consistency loss, which aims to ensure that nearby scene points have consistent colors. The geometric embeddings from the camera encoder and the rendered multiple views are then fed into the ControlNet Refiner to generate high-quality and view-consistent 3D assets.
Figure 5: Comparison of existing text-to-3D approaches based on Gaussian splatting. We compare the results with three state-of-the-art 3D Gaussian Splatting based approaches. The output generated by DreamPolisher achieves better consistency and has more intricate appearance details.
...and 4 more figures

DreamPolisher: Towards High-Quality Text-to-3D Generation via Geometric Diffusion

TL;DR

Abstract

DreamPolisher: Towards High-Quality Text-to-3D Generation via Geometric Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (9)