Table of Contents
Fetching ...

ChatGarment: Garment Estimation, Generation and Editing via Large Language Models

Siyuan Bian, Chenghao Xu, Yuliang Xiu, Artur Grigorev, Zhen Liu, Cewu Lu, Michael J. Black, Yao Feng

TL;DR

ChatGarment tackles the problem of turning images or text into actionable sewing-pattern representations for 3D garments by linking a Vision-Language Model to a refined GarmentCode workflow that outputs a structured JSON. It introduces GarmentCodeRC to broaden garment coverage, and a large automated data pipeline (≈20K garments, ≈1M images) with GPT-4o labeling to train multimodal reasoning and numeric decoding. A dedicated projection layer decodes numerical garment attributes from language tokens, enabling precise pattern generation and draping on SMPL-X bodies. Across Dress4D and CloSe, ChatGarment achieves state-of-the-art reconstruction, editing, and text-guided generation, while supporting interactive multi-turn dialogues for design refinement. The work reduces manual workload in fashion and gaming pipelines and releases code and data to foster broader adoption.

Abstract

We introduce ChatGarment, a novel approach that leverages large vision-language models (VLMs) to automate the estimation, generation, and editing of 3D garments from images or text descriptions. Unlike previous methods that struggle in real-world scenarios or lack interactive editing capabilities, ChatGarment can estimate sewing patterns from in-the-wild images or sketches, generate them from text descriptions, and edit garments based on user instructions, all within an interactive dialogue. These sewing patterns can then be draped on a 3D body and animated. This is achieved by finetuning a VLM to directly generate a JSON file that includes both textual descriptions of garment types and styles, as well as continuous numerical attributes. This JSON file is then used to create sewing patterns through a programming parametric model. To support this, we refine the existing programming model, GarmentCode, by expanding its garment type coverage and simplifying its structure for efficient VLM fine-tuning. Additionally, we construct a large-scale dataset of image-to-sewing-pattern and text-to-sewing-pattern pairs through an automated data pipeline. Extensive evaluations demonstrate ChatGarment's ability to accurately reconstruct, generate, and edit garments from multimodal inputs, highlighting its potential to simplify workflows in fashion and gaming applications. Code and data are available at https://chatgarment.github.io/ .

ChatGarment: Garment Estimation, Generation and Editing via Large Language Models

TL;DR

ChatGarment tackles the problem of turning images or text into actionable sewing-pattern representations for 3D garments by linking a Vision-Language Model to a refined GarmentCode workflow that outputs a structured JSON. It introduces GarmentCodeRC to broaden garment coverage, and a large automated data pipeline (≈20K garments, ≈1M images) with GPT-4o labeling to train multimodal reasoning and numeric decoding. A dedicated projection layer decodes numerical garment attributes from language tokens, enabling precise pattern generation and draping on SMPL-X bodies. Across Dress4D and CloSe, ChatGarment achieves state-of-the-art reconstruction, editing, and text-guided generation, while supporting interactive multi-turn dialogues for design refinement. The work reduces manual workload in fashion and gaming pipelines and releases code and data to foster broader adoption.

Abstract

We introduce ChatGarment, a novel approach that leverages large vision-language models (VLMs) to automate the estimation, generation, and editing of 3D garments from images or text descriptions. Unlike previous methods that struggle in real-world scenarios or lack interactive editing capabilities, ChatGarment can estimate sewing patterns from in-the-wild images or sketches, generate them from text descriptions, and edit garments based on user instructions, all within an interactive dialogue. These sewing patterns can then be draped on a 3D body and animated. This is achieved by finetuning a VLM to directly generate a JSON file that includes both textual descriptions of garment types and styles, as well as continuous numerical attributes. This JSON file is then used to create sewing patterns through a programming parametric model. To support this, we refine the existing programming model, GarmentCode, by expanding its garment type coverage and simplifying its structure for efficient VLM fine-tuning. Additionally, we construct a large-scale dataset of image-to-sewing-pattern and text-to-sewing-pattern pairs through an automated data pipeline. Extensive evaluations demonstrate ChatGarment's ability to accurately reconstruct, generate, and edit garments from multimodal inputs, highlighting its potential to simplify workflows in fashion and gaming applications. Code and data are available at https://chatgarment.github.io/ .

Paper Structure

This paper contains 27 sections, 2 equations, 14 figures, 15 tables.

Figures (14)

  • Figure 1: As a multimodal 3D garment creator, ChatGarment understands both images and language. It can estimate complex 3D garments represented as sewing patterns from a single image, which can be easily animated and simulated. It also supports garment editing based on text instructions.
  • Figure 2: Pipeline of ChatGarment. ChatGarment takes text or an image as input and generates a JSON file. The JSON file is decoded into 2D sewing patterns using GarmentCode GarmentCode2023 and then draped onto the human body. The final 3D garments are compatible with simulation software (e.g., MAYA, Blender, Style3D, etc.).
  • Figure 3: GarmentCodeRC. Left: new options to model open-front jackets, high-waist skirts, and tight pant legs. Right: simplified JSON configuration for more efficient LLM training.
  • Figure 4: Data Construction Pipeline. We generate garments from JSON configurations, simulate them with ContourCraft contourcraft and render with Blender.
  • Figure 5: ChatGarment's dialog modes. Images and texts are adaptively combined to guide garment generation and editing. Text output between STARTS and ENDS contains information about the JSON configuration, which is then converted to 3D garments (JSON2Garment) for visualization and simulation purposes.
  • ...and 9 more figures