Table of Contents
Fetching ...

Generating Surface for Text-to-3D using 2D Gaussian Splatting

Huanning Dong, Fan Li, Ping Kuang, Jianwen Min

TL;DR

This work introduces DirectGaussian, a surfel-based Text-to-3D framework that renders object surfaces using 2D Gaussian splatting guided by text-conditioned multi-view priors. It constructs a Gaussian surfel dataset (TextGaussian) and learns coarse surfels via a multi-head attention model, then refines them through a four-view optimization that enforces texture/normal consistency and global geometric coherence with curvature and convergence constraints. The method achieves diverse, high-fidelity textured 3D surfaces and shows robustness to unseen viewpoints, outperforming several Gaussian-splatting baselines in qualitative quality and user preference, while maintaining efficient rendering. By integrating 360° surface curvature constraints with diffusion-based priors, DirectGaussian enables practical text-to-3D content creation suitable for animation, VR, and game pipelines.

Abstract

Recent advancements in Text-to-3D modeling have shown significant potential for the creation of 3D content. However, due to the complex geometric shapes of objects in the natural world, generating 3D content remains a challenging task. Current methods either leverage 2D diffusion priors to recover 3D geometry, or train the model directly based on specific 3D representations. In this paper, we propose a novel method named DirectGaussian, which focuses on generating the surfaces of 3D objects represented by surfels. In DirectGaussian, we utilize conditional text generation models and the surface of a 3D object is rendered by 2D Gaussian splatting with multi-view normal and texture priors. For multi-view geometric consistency problems, DirectGaussian incorporates curvature constraints on the generated surface during optimization process. Through extensive experiments, we demonstrate that our framework is capable of achieving diverse and high-fidelity 3D content creation.

Generating Surface for Text-to-3D using 2D Gaussian Splatting

TL;DR

This work introduces DirectGaussian, a surfel-based Text-to-3D framework that renders object surfaces using 2D Gaussian splatting guided by text-conditioned multi-view priors. It constructs a Gaussian surfel dataset (TextGaussian) and learns coarse surfels via a multi-head attention model, then refines them through a four-view optimization that enforces texture/normal consistency and global geometric coherence with curvature and convergence constraints. The method achieves diverse, high-fidelity textured 3D surfaces and shows robustness to unseen viewpoints, outperforming several Gaussian-splatting baselines in qualitative quality and user preference, while maintaining efficient rendering. By integrating 360° surface curvature constraints with diffusion-based priors, DirectGaussian enables practical text-to-3D content creation suitable for animation, VR, and game pipelines.

Abstract

Recent advancements in Text-to-3D modeling have shown significant potential for the creation of 3D content. However, due to the complex geometric shapes of objects in the natural world, generating 3D content remains a challenging task. Current methods either leverage 2D diffusion priors to recover 3D geometry, or train the model directly based on specific 3D representations. In this paper, we propose a novel method named DirectGaussian, which focuses on generating the surfaces of 3D objects represented by surfels. In DirectGaussian, we utilize conditional text generation models and the surface of a 3D object is rendered by 2D Gaussian splatting with multi-view normal and texture priors. For multi-view geometric consistency problems, DirectGaussian incorporates curvature constraints on the generated surface during optimization process. Through extensive experiments, we demonstrate that our framework is capable of achieving diverse and high-fidelity 3D content creation.

Paper Structure

This paper contains 18 sections, 9 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: DirectGaussian generates 3D objects from a natural language caption such as "A pumpkin with a green stem and jack-o'-lantern features”. The coarse Gaussian is produced by a multi-head attention model, whose hidden representations incorporate normal and texture latents extracted from two multi-view diffusion models, both conditionally guided by text. DirectGaussian refines the coarse Gaussians by minimizing splatting loss, as well as surface convergence and curvature constraints that enforce geometric consistency across 360-degree perspectives.
  • Figure 2: The gallery of text-to-3d generation from DirectGaussian. Given text prompts as description input, our method outputs high-quality textured 3D objects within minutes. The text prompt is selected from historical text-to-3D works.
  • Figure 3: Caption: A floating house supported by four stone pillars above a dirt and stone block base, with a fenced grassy rooftop and a water feature at the bottom.
  • Figure 5: Comparison with three different methods based on Gaussian Splatting method. Zoom-in for details.
  • Figure 6: The caption is "A beautiful cyborg with brown hair". Our learned coarse Gaussians lead to more stable convergence and better geometric fidelity (right). Random initialization often results in fragmented or inconsistent geometry (left).
  • ...and 1 more figures