TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation
Yufei Liu, Junwei Zhu, Junshu Tang, Shijie Zhang, Jiangning Zhang, Weijian Cao, Chengjie Wang, Yunsheng Wu, Dongjin Huang
TL;DR
TexDreamer tackles the challenge of high-fidelity 3D human texture generation under semantic UV layouts by enabling zero-shot multimodal inputs. It introduces a two-branch framework: Text-to-UV (T2UV) achieved via efficient texture-adaptation finetuning of a diffusion model and Image-to-UV (I2UV) through a feature translator that maps image features into the T2UV conditioning space. The ATLAS dataset provides 50k high-resolution textures paired with descriptions to support training and evaluation in the UV space. Across qualitative and quantitative evaluations, TexDreamer outperforms prior methods in texture quality and text consistency, enabling rapid texture generation for dressed avatars and virtual try-on, while acknowledging ethical considerations and alignment limitations with real-world data.
Abstract
Texturing 3D humans with semantic UV maps remains a challenge due to the difficulty of acquiring reasonably unfolded UV. Despite recent text-to-3D advancements in supervising multi-view renderings using large text-to-image (T2I) models, issues persist with generation speed, text consistency, and texture quality, resulting in data scarcity among existing datasets. We present TexDreamer, the first zero-shot multimodal high-fidelity 3D human texture generation model. Utilizing an efficient texture adaptation finetuning strategy, we adapt large T2I model to a semantic UV structure while preserving its original generalization capability. Leveraging a novel feature translator module, the trained model is capable of generating high-fidelity 3D human textures from either text or image within seconds. Furthermore, we introduce ArTicuLated humAn textureS (ATLAS), the largest high-resolution (1024 X 1024) 3D human texture dataset which contains 50k high-fidelity textures with text descriptions.
