Table of Contents
Fetching ...

TEXGen: a Generative Diffusion Model for Mesh Textures

Xin Yu, Ze Yuan, Yuan-Chen Guo, Ying-Tian Liu, JianHui Liu, Yangguang Li, Yan-Pei Cao, Ding Liang, Xiaojuan Qi

TL;DR

This work trains a large diffusion model capable of directly generating high-resolution texture maps in a feed-forward manner and proposes a scalable network architecture that interleaves convolutions on UV maps with attention layers on point clouds.

Abstract

While high-quality texture maps are essential for realistic 3D asset rendering, few studies have explored learning directly in the texture space, especially on large-scale datasets. In this work, we depart from the conventional approach of relying on pre-trained 2D diffusion models for test-time optimization of 3D textures. Instead, we focus on the fundamental problem of learning in the UV texture space itself. For the first time, we train a large diffusion model capable of directly generating high-resolution texture maps in a feed-forward manner. To facilitate efficient learning in high-resolution UV spaces, we propose a scalable network architecture that interleaves convolutions on UV maps with attention layers on point clouds. Leveraging this architectural design, we train a 700 million parameter diffusion model that can generate UV texture maps guided by text prompts and single-view images. Once trained, our model naturally supports various extended applications, including text-guided texture inpainting, sparse-view texture completion, and text-driven texture synthesis. Project page is at http://cvmi-lab.github.io/TEXGen/.

TEXGen: a Generative Diffusion Model for Mesh Textures

TL;DR

This work trains a large diffusion model capable of directly generating high-resolution texture maps in a feed-forward manner and proposes a scalable network architecture that interleaves convolutions on UV maps with attention layers on point clouds.

Abstract

While high-quality texture maps are essential for realistic 3D asset rendering, few studies have explored learning directly in the texture space, especially on large-scale datasets. In this work, we depart from the conventional approach of relying on pre-trained 2D diffusion models for test-time optimization of 3D textures. Instead, we focus on the fundamental problem of learning in the UV texture space itself. For the first time, we train a large diffusion model capable of directly generating high-resolution texture maps in a feed-forward manner. To facilitate efficient learning in high-resolution UV spaces, we propose a scalable network architecture that interleaves convolutions on UV maps with attention layers on point clouds. Leveraging this architectural design, we train a 700 million parameter diffusion model that can generate UV texture maps guided by text prompts and single-view images. Once trained, our model naturally supports various extended applications, including text-guided texture inpainting, sparse-view texture completion, and text-driven texture synthesis. Project page is at http://cvmi-lab.github.io/TEXGen/.

Paper Structure

This paper contains 29 sections, 4 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: An illustration of (a) a mesh with its (b) UV map. Three islands $S_1$, $S_2$ and $S_3$ are shown both on the mesh surface and its flattened UV map, where continuous islands $S_1$ and $S_2$ are positioned far apart on the UV map while disconnected islands $S_1$ and $S_3$ show closer distance on the UV map.
  • Figure 2: An overview of TEXGen. (a). An overview of our training pipeline. We train a diffusion model to generate high-resolution texture maps for a given mesh $S$ based on a single-view image $I$ and text descriptions by learning to denoise from a noise texture map $x_t$. The core of our denoising network is our proposed hybrid 2D-3D block. (b). The structure of a single hybrid block. (c)-(d). The detailed designs of our UV head block and point block.
  • Figure 3: An illustration of the feature learning procedure in 3D space. In panel (a), we start with rasterized dense point features, which we sparsify using grid-pooling to create sparse point features shown in (b). Different pools are indicated by various colors in (a). These points are then serialized to determine their order for subsequent group-based self-attention, as part of the learning process shown in (d). In (c), we visualize different groups formed based on Hilbert serialization, where each color signifies a distinct group. Finally, the processed features are scattered back to their original coordinates, providing the output dense point features.
  • Figure 4: Texture generation results. For given meshes, our method can synthesize highly detailed textures conditioned on guided single-view images and text prompts. We show three novel view images from our textured results and representative zoom-in regions from the textured mesh. The generated full texture maps are also shown.
  • Figure 5: Comparison with state-of-the-art methods. We compare our method with four representative state-of-the-art methods. Our model can synthesize more detailed and coherent textures compared to these methods which rely on test-time optimization using a 2D pretrained text-to-image diffusion model. Also, our method trained on the 3D dataset and 3D representation avoids the Janus problem that commonly occurs in other methods.
  • ...and 6 more figures