Table of Contents
Fetching ...

AIpparel: A Multimodal Foundation Model for Digital Garments

Kiyohiro Nakayama, Jan Ackermann, Timur Levent Kesdogan, Yang Zheng, Maria Korosteleva, Olga Sorkine-Hornung, Leonidas J. Guibas, Guandao Yang, Gordon Wetzstein

TL;DR

AIpparel introduces a multimodal foundation model for digital garments by fine-tuning a large multimodal model on the GCD-MM dataset and employing a compact sewing-pattern tokenizer to encode complex patchwork geometry. The approach enables accurate image-to-pattern prediction, text-conditioned generation, and language-driven editing, outperforming state-of-the-art single-modal baselines and enabling novel multimodal garment workflows. Key contributions include the GarmentCodeData-MultiModal dataset, a lightweight yet expressive tokenization scheme, and regression heads for continuous pattern parameters, all culminating in simulation-ready sewing patterns. This work advances AI-assisted fashion design by translating web-scale vision-language knowledge into actionable garment generation and editing, with potential impacts in design efficiency and fabrication while acknowledging dataset biases and societal considerations.

Abstract

Apparel is essential to human life, offering protection, mirroring cultural identities, and showcasing personal style. Yet, the creation of garments remains a time-consuming process, largely due to the manual work involved in designing them. To simplify this process, we introduce AIpparel, a multimodal foundation model for generating and editing sewing patterns. Our model fine-tunes state-of-the-art large multimodal models (LMMs) on a custom-curated large-scale dataset of over 120,000 unique garments, each with multimodal annotations including text, images, and sewing patterns. Additionally, we propose a novel tokenization scheme that concisely encodes these complex sewing patterns so that LLMs can learn to predict them efficiently. AIpparel achieves state-of-the-art performance in single-modal tasks, including text-to-garment and image-to-garment prediction, and enables novel multimodal garment generation applications such as interactive garment editing. The project website is at https://georgenakayama.github.io/AIpparel/.

AIpparel: A Multimodal Foundation Model for Digital Garments

TL;DR

AIpparel introduces a multimodal foundation model for digital garments by fine-tuning a large multimodal model on the GCD-MM dataset and employing a compact sewing-pattern tokenizer to encode complex patchwork geometry. The approach enables accurate image-to-pattern prediction, text-conditioned generation, and language-driven editing, outperforming state-of-the-art single-modal baselines and enabling novel multimodal garment workflows. Key contributions include the GarmentCodeData-MultiModal dataset, a lightweight yet expressive tokenization scheme, and regression heads for continuous pattern parameters, all culminating in simulation-ready sewing patterns. This work advances AI-assisted fashion design by translating web-scale vision-language knowledge into actionable garment generation and editing, with potential impacts in design efficiency and fabrication while acknowledging dataset biases and societal considerations.

Abstract

Apparel is essential to human life, offering protection, mirroring cultural identities, and showcasing personal style. Yet, the creation of garments remains a time-consuming process, largely due to the manual work involved in designing them. To simplify this process, we introduce AIpparel, a multimodal foundation model for generating and editing sewing patterns. Our model fine-tunes state-of-the-art large multimodal models (LMMs) on a custom-curated large-scale dataset of over 120,000 unique garments, each with multimodal annotations including text, images, and sewing patterns. Additionally, we propose a novel tokenization scheme that concisely encodes these complex sewing patterns so that LLMs can learn to predict them efficiently. AIpparel achieves state-of-the-art performance in single-modal tasks, including text-to-garment and image-to-garment prediction, and enables novel multimodal garment generation applications such as interactive garment editing. The project website is at https://georgenakayama.github.io/AIpparel/.

Paper Structure

This paper contains 54 sections, 3 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: AIpparel. We present a multimodal foundation model for digital garments trained by fine-tuning a large multimodal model on a custom sewing pattern dataset using a novel tokenization scheme for these patterns. AIpparel generates complex, diverse, high-quality sewing patterns based on multimodal inputs, such as text and images, and it unlocks new applications such as language-instructed sewing pattern editing. The generated sewing patterns can be directly used to simulate the corresponding 3D garments.
  • Figure 2: Illustration of Our Method. AIpparel uses a novel sewing pattern tokenizer (light blue region) to tokenize each panel into a set of special tokens (light green region). Panel vertex positions and 3D transformations are incorporated using positional embeddings (colored arrows) to the tokens. AIpparel takes in multimodal inputs, such as images and texts (light orange region), to output sewing patterns using autoregressive sampling (light grey region). Finally, the output is decoded to produce simulation-ready sewing patterns (light pink region). See \ref{['sec:method']} for method details.
  • Figure 3: Image-to-Garment Prediction (Qualitative).GCD-MM (Left): our model can reconstruct suitable sewing patterns from the input image alone. In contrast, SewFormer does not produce simulation-ready sewing patterns despite fine-tuning. SewFactory (Right): SewFormer produces inaccurate panels (top row) and incorrect garment types (bottom row) while AIpparel accurately recovers sewing patterns from the images, resulting in superior simulation results. See Sec. \ref{['sec:img2garment']}.
  • Figure 4: Multimodal Sewing Pattern Prediction (Qualitative). AIpparel accurately predicts sewing patterns that follows the inputs better than the baselines. See Sec. \ref{['sec:multi2garment']}.
  • Figure 5: Sewing Pattern Editing (Qualitative). Our model follows the editing instructions more accurately compared with the baseline by accurately including a hood to the tank top (top row) and elongating the skirt (bottom row). See Sec. \ref{['sec:garmentedit']}.
  • ...and 10 more figures