Table of Contents
Fetching ...

WordRobe: Text-Guided Generation of Textured 3D Garments

Astitva Srivastava, Pranav Manu, Amit Raj, Varun Jampani, Avinash Sharma

TL;DR

WordRobe introduces a text-guided framework for unposed, textured 3D garments by learning a two-stage garment latent space (via coarse-to-fine decoding of unsigned distance fields) and aligning it to CLIP with a weakly supervised mapping network. It then performs texture synthesis in a single forward step using ControlNet on view-composited depth inputs, yielding fast, view-consistent textures. The method demonstrates state-of-the-art performance in latent-space learning, garment interpolation, and texture synthesis, with broad capabilities including sketch- and image-guided generation and editing, and strong user study results. Together, these innovations enable scalable, production-ready generation and editing of textured 3D garments from natural prompts, with potential impact on virtual try-on, avatars, gaming, and AR/VR pipelines.

Abstract

In this paper, we tackle a new and challenging problem of text-driven generation of 3D garments with high-quality textures. We propose "WordRobe", a novel framework for the generation of unposed & textured 3D garment meshes from user-friendly text prompts. We achieve this by first learning a latent representation of 3D garments using a novel coarse-to-fine training strategy and a loss for latent disentanglement, promoting better latent interpolation. Subsequently, we align the garment latent space to the CLIP embedding space in a weakly supervised manner, enabling text-driven 3D garment generation and editing. For appearance modeling, we leverage the zero-shot generation capability of ControlNet to synthesize view-consistent texture maps in a single feed-forward inference step, thereby drastically decreasing the generation time as compared to existing methods. We demonstrate superior performance over current SOTAs for learning 3D garment latent space, garment interpolation, and text-driven texture synthesis, supported by quantitative evaluation and qualitative user study. The unposed 3D garment meshes generated using WordRobe can be directly fed to standard cloth simulation & animation pipelines without any post-processing.

WordRobe: Text-Guided Generation of Textured 3D Garments

TL;DR

WordRobe introduces a text-guided framework for unposed, textured 3D garments by learning a two-stage garment latent space (via coarse-to-fine decoding of unsigned distance fields) and aligning it to CLIP with a weakly supervised mapping network. It then performs texture synthesis in a single forward step using ControlNet on view-composited depth inputs, yielding fast, view-consistent textures. The method demonstrates state-of-the-art performance in latent-space learning, garment interpolation, and texture synthesis, with broad capabilities including sketch- and image-guided generation and editing, and strong user study results. Together, these innovations enable scalable, production-ready generation and editing of textured 3D garments from natural prompts, with potential impact on virtual try-on, avatars, gaming, and AR/VR pipelines.

Abstract

In this paper, we tackle a new and challenging problem of text-driven generation of 3D garments with high-quality textures. We propose "WordRobe", a novel framework for the generation of unposed & textured 3D garment meshes from user-friendly text prompts. We achieve this by first learning a latent representation of 3D garments using a novel coarse-to-fine training strategy and a loss for latent disentanglement, promoting better latent interpolation. Subsequently, we align the garment latent space to the CLIP embedding space in a weakly supervised manner, enabling text-driven 3D garment generation and editing. For appearance modeling, we leverage the zero-shot generation capability of ControlNet to synthesize view-consistent texture maps in a single feed-forward inference step, thereby drastically decreasing the generation time as compared to existing methods. We demonstrate superior performance over current SOTAs for learning 3D garment latent space, garment interpolation, and text-driven texture synthesis, supported by quantitative evaluation and qualitative user study. The unposed 3D garment meshes generated using WordRobe can be directly fed to standard cloth simulation & animation pipelines without any post-processing.
Paper Structure (24 sections, 5 equations, 20 figures, 5 tables)

This paper contains 24 sections, 5 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: Text-guided generation and editing of 3D textured garments using WordRobe.
  • Figure 2: Overview of the proposed method for text-guided 3D garment generation.
  • Figure 3: The proposed coarse-to-fine training strategy for learning garment latent space.
  • Figure 4: Automated training data generation & weakly supervised training of $MLP_{map}$.
  • Figure 5: Text-driven manipulation of the latent code.
  • ...and 15 more figures