Table of Contents
Fetching ...

TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation

Yufei Liu, Junwei Zhu, Junshu Tang, Shijie Zhang, Jiangning Zhang, Weijian Cao, Chengjie Wang, Yunsheng Wu, Dongjin Huang

TL;DR

TexDreamer tackles the challenge of high-fidelity 3D human texture generation under semantic UV layouts by enabling zero-shot multimodal inputs. It introduces a two-branch framework: Text-to-UV (T2UV) achieved via efficient texture-adaptation finetuning of a diffusion model and Image-to-UV (I2UV) through a feature translator that maps image features into the T2UV conditioning space. The ATLAS dataset provides 50k high-resolution textures paired with descriptions to support training and evaluation in the UV space. Across qualitative and quantitative evaluations, TexDreamer outperforms prior methods in texture quality and text consistency, enabling rapid texture generation for dressed avatars and virtual try-on, while acknowledging ethical considerations and alignment limitations with real-world data.

Abstract

Texturing 3D humans with semantic UV maps remains a challenge due to the difficulty of acquiring reasonably unfolded UV. Despite recent text-to-3D advancements in supervising multi-view renderings using large text-to-image (T2I) models, issues persist with generation speed, text consistency, and texture quality, resulting in data scarcity among existing datasets. We present TexDreamer, the first zero-shot multimodal high-fidelity 3D human texture generation model. Utilizing an efficient texture adaptation finetuning strategy, we adapt large T2I model to a semantic UV structure while preserving its original generalization capability. Leveraging a novel feature translator module, the trained model is capable of generating high-fidelity 3D human textures from either text or image within seconds. Furthermore, we introduce ArTicuLated humAn textureS (ATLAS), the largest high-resolution (1024 X 1024) 3D human texture dataset which contains 50k high-fidelity textures with text descriptions.

TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation

TL;DR

TexDreamer tackles the challenge of high-fidelity 3D human texture generation under semantic UV layouts by enabling zero-shot multimodal inputs. It introduces a two-branch framework: Text-to-UV (T2UV) achieved via efficient texture-adaptation finetuning of a diffusion model and Image-to-UV (I2UV) through a feature translator that maps image features into the T2UV conditioning space. The ATLAS dataset provides 50k high-resolution textures paired with descriptions to support training and evaluation in the UV space. Across qualitative and quantitative evaluations, TexDreamer outperforms prior methods in texture quality and text consistency, enabling rapid texture generation for dressed avatars and virtual try-on, while acknowledging ethical considerations and alignment limitations with real-world data.

Abstract

Texturing 3D humans with semantic UV maps remains a challenge due to the difficulty of acquiring reasonably unfolded UV. Despite recent text-to-3D advancements in supervising multi-view renderings using large text-to-image (T2I) models, issues persist with generation speed, text consistency, and texture quality, resulting in data scarcity among existing datasets. We present TexDreamer, the first zero-shot multimodal high-fidelity 3D human texture generation model. Utilizing an efficient texture adaptation finetuning strategy, we adapt large T2I model to a semantic UV structure while preserving its original generalization capability. Leveraging a novel feature translator module, the trained model is capable of generating high-fidelity 3D human textures from either text or image within seconds. Furthermore, we introduce ArTicuLated humAn textureS (ATLAS), the largest high-resolution (1024 X 1024) 3D human texture dataset which contains 50k high-fidelity textures with text descriptions.
Paper Structure (19 sections, 4 equations, 17 figures, 6 tables)

This paper contains 19 sections, 4 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Left: Overview of the ATLAS dataset. ATLAS is so far the largest high-resolution ($1,024\times1,024$) 3D human texture dataset paired with textual descriptions, including both real and fictional identities. Right: Basic structure of our TexDreamer. The first zero-shot high-fidelity human texture generation method that supports both text and image inputs.
  • Figure 2: Pipeline for generating synthetic data. Left: Sample texture acquisition. We first use a differentiable render to optimize UV from multi-view images, then further refine them by projection painting. Acquired sample textures with prompts are used to train T2UV in TexDreamer. Right: Diverse textured human synthesis. With the help of ChatGPT, we utilize T2UV to generate 50k human textures. Human images are rendered with animation sequence, background image, HDR lighting, and perspective camera. Orange stars indicate included data in our ATLAS dataset.
  • Figure 3: Structure of TexDreamer. We conduct two training stages. For T2UV (green), we use LDM denoise loss $\mathcal{L}_1$ to optimize the text encoder and U-Net. For I2UV (blue), the feature translator $\phi_{i2t}$ map the input image feature encoded by $\phi_{i-enc}$ to a conditional feature $f_{i2t}$. We train I2UV by optimizing $\phi_{t-enc}$ and $\phi_{i-enc}$ with $\mathcal{L}_2$.
  • Figure 4: Comparison of attention maps between the original SD and TexDreamer T2UV. The response area of the original SD is random, while T2UV consistently maps the prompts to the learned UV structure.
  • Figure 5: Qualitative comparison of texture generation from text. We compare TexDreamer with state-of-the-art texture generation methods, including Text2Textext2tex, TEXTure texture, Latent-Paint latent-paint and Fantasia3D fantasia3d. Our results clearly achieve the finest facial details and the highest overall quality. Please zoom in for a better view.
  • ...and 12 more figures