Table of Contents
Fetching ...

Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation

Slava Elizarov, Ciara Rowles, Simon Donné

TL;DR

Geometry Image Diffusion (GIMDiffusion), a novel Text-to-3D model that utilizes geometry images to efficiently represent 3D shapes using 2D images, thereby avoiding the need for complex 3D-aware architectures.

Abstract

Generating high-quality 3D objects from textual descriptions remains a challenging problem due to computational cost, the scarcity of 3D data, and complex 3D representations. We introduce Geometry Image Diffusion (GIMDiffusion), a novel Text-to-3D model that utilizes geometry images to efficiently represent 3D shapes using 2D images, thereby avoiding the need for complex 3D-aware architectures. By integrating a Collaborative Control mechanism, we exploit the rich 2D priors of existing Text-to-Image models such as Stable Diffusion. This enables strong generalization even with limited 3D training data (allowing us to use only high-quality training data) as well as retaining compatibility with guidance techniques such as IPAdapter. In short, GIMDiffusion enables the generation of 3D assets at speeds comparable to current Text-to-Image models. The generated objects consist of semantically meaningful, separate parts and include internal structures, enhancing both usability and versatility.

Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation

TL;DR

Geometry Image Diffusion (GIMDiffusion), a novel Text-to-3D model that utilizes geometry images to efficiently represent 3D shapes using 2D images, thereby avoiding the need for complex 3D-aware architectures.

Abstract

Generating high-quality 3D objects from textual descriptions remains a challenging problem due to computational cost, the scarcity of 3D data, and complex 3D representations. We introduce Geometry Image Diffusion (GIMDiffusion), a novel Text-to-3D model that utilizes geometry images to efficiently represent 3D shapes using 2D images, thereby avoiding the need for complex 3D-aware architectures. By integrating a Collaborative Control mechanism, we exploit the rich 2D priors of existing Text-to-Image models such as Stable Diffusion. This enables strong generalization even with limited 3D training data (allowing us to use only high-quality training data) as well as retaining compatibility with guidance techniques such as IPAdapter. In short, GIMDiffusion enables the generation of 3D assets at speeds comparable to current Text-to-Image models. The generated objects consist of semantically meaningful, separate parts and include internal structures, enhancing both usability and versatility.
Paper Structure (21 sections, 9 figures)

This paper contains 21 sections, 9 figures.

Figures (9)

  • Figure 1: Meshes generated with our proposed Geometry Image Diffusion (GIMDiffusion) method. For each object, we show the generated albedo texture, the textured mesh, the untextured mesh, and the respective text prompt. The objects are generated entirely using our method: both the structure, texture and layout of the UV map are generated completely from scratch.
  • Figure 2: (a) Ground-truth geometry, (b) geometry image and (c) albedo texture from our data pre-processing, and (d) the reconstruction using our dedicated VAE. We note the highly separable nature of the ground truth object, which is split into small components. The only visible artifact after decoding is the missing connection between the various charts of the geometry image, as discussed in \ref{['sec:data-handling', 'sec:limitations']}.
  • Figure 3: The Collaborative Control Scheme Boss2024CollaborativeCF applied in GIMDiffusion, where two separate diffusion models generate respectively albedo textures and geometry images. The former is a frozen pre-trained model, while the latter is an architectural clone trained from scratch.
  • Figure 4: Seam detection in our multi-chart geometry image creation procedure to isolate locally invertible areas of the UV mapping. (Left) If two neighboring mesh regions correspond to two distinct charts in the UV map, the vertices on the boundary will be duplicated and have different UV coordinates. (Right) If the UV mapping loops back onto itself, there will be a local minimum in the UV access heatmap, and we place the seam along the line with the smallest UV-degree to effectively separate these regions.
  • Figure 5: The resulting triangulation of our generated objects is near-uniform over the surface, thanks to the area-preserving nature of the geometry images in our training dataset.
  • ...and 4 more figures