Table of Contents
Fetching ...

DiCTI: Diffusion-based Clothing Designer via Text-guided Input

Ajda Lampe, Julija Stopar, Deepak Kumar Jain, Shinichiro Omachi, Peter Peer, Vitomir Štruc

TL;DR

DiCTI tackles fast, text-guided garment design by reframing editing as inpainting with a diffusion model. It introduces a two-stage pipeline: a Mask Generation Module that uses DensePose to produce body and head masks, and a Garment Synthesis Module that performs latent-diffusion inpainting conditioned on text prompts, with an identity-preserving post-processing step. Evaluations on VITON-HD and Fashionpedia show DiCTI outperforms the state-of-the-art FICE in both image realism and prompt adherence, validated by quantitative metrics and a human study. The approach demonstrates robustness to unconstrained settings and supports diverse garment designs, offering a practical tool for designers and consumer-facing applications.

Abstract

Recent developments in deep generative models have opened up a wide range of opportunities for image synthesis, leading to significant changes in various creative fields, including the fashion industry. While numerous methods have been proposed to benefit buyers, particularly in virtual try-on applications, there has been relatively less focus on facilitating fast prototyping for designers and customers seeking to order new designs. To address this gap, we introduce DiCTI (Diffusion-based Clothing Designer via Text-guided Input), a straightforward yet highly effective approach that allows designers to quickly visualize fashion-related ideas using text inputs only. Given an image of a person and a description of the desired garments as input, DiCTI automatically generates multiple high-resolution, photorealistic images that capture the expressed semantics. By leveraging a powerful diffusion-based inpainting model conditioned on text inputs, DiCTI is able to synthesize convincing, high-quality images with varied clothing designs that viably follow the provided text descriptions, while being able to process very diverse and challenging inputs, captured in completely unconstrained settings. We evaluate DiCTI in comprehensive experiments on two different datasets (VITON-HD and Fashionpedia) and in comparison to the state-of-the-art (SoTa). The results of our experiments show that DiCTI convincingly outperforms the SoTA competitor in generating higher quality images with more elaborate garments and superior text prompt adherence, both according to standard quantitative evaluation measures and human ratings, generated as part of a user study.

DiCTI: Diffusion-based Clothing Designer via Text-guided Input

TL;DR

DiCTI tackles fast, text-guided garment design by reframing editing as inpainting with a diffusion model. It introduces a two-stage pipeline: a Mask Generation Module that uses DensePose to produce body and head masks, and a Garment Synthesis Module that performs latent-diffusion inpainting conditioned on text prompts, with an identity-preserving post-processing step. Evaluations on VITON-HD and Fashionpedia show DiCTI outperforms the state-of-the-art FICE in both image realism and prompt adherence, validated by quantitative metrics and a human study. The approach demonstrates robustness to unconstrained settings and supports diverse garment designs, offering a practical tool for designers and consumer-facing applications.

Abstract

Recent developments in deep generative models have opened up a wide range of opportunities for image synthesis, leading to significant changes in various creative fields, including the fashion industry. While numerous methods have been proposed to benefit buyers, particularly in virtual try-on applications, there has been relatively less focus on facilitating fast prototyping for designers and customers seeking to order new designs. To address this gap, we introduce DiCTI (Diffusion-based Clothing Designer via Text-guided Input), a straightforward yet highly effective approach that allows designers to quickly visualize fashion-related ideas using text inputs only. Given an image of a person and a description of the desired garments as input, DiCTI automatically generates multiple high-resolution, photorealistic images that capture the expressed semantics. By leveraging a powerful diffusion-based inpainting model conditioned on text inputs, DiCTI is able to synthesize convincing, high-quality images with varied clothing designs that viably follow the provided text descriptions, while being able to process very diverse and challenging inputs, captured in completely unconstrained settings. We evaluate DiCTI in comprehensive experiments on two different datasets (VITON-HD and Fashionpedia) and in comparison to the state-of-the-art (SoTa). The results of our experiments show that DiCTI convincingly outperforms the SoTA competitor in generating higher quality images with more elaborate garments and superior text prompt adherence, both according to standard quantitative evaluation measures and human ratings, generated as part of a user study.
Paper Structure (17 sections, 8 equations, 11 figures, 3 tables)

This paper contains 17 sections, 8 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Example results generated by DiCTI. Given an initial image and a description of the desired outfit, DiCTI, the proposed model for text-guided garment design, produces a photo-realistic image with the person in the original image in an outfit that matches the provided text description.
  • Figure 2: High-level overview of the proposed DiCTI method. DiCTI consists of multiple components. The Mask Generation Module (A) generates a binary mask covering the body/clothing and another one covering the head of the person in the input image. The body mask, along with the input image, is then passed to the Garment Synthesis Module (B), responsible for completing the masked-out parts of the image in adherence to the prompt. The synthesized image then undergoes post-processing to restore facial features that may have been altered during synthesis and ensure Identity Preservation.
  • Figure 3: Comparison of FICE and DiCTI. Example synthesis results are presented for various text prompts.
  • Figure 4: Examples from the user study. While DiCTI occasionally alters the pose slightly, the results are commonly of higher quality and more faithful to the text prompt.
  • Figure 5: Examples of different fabrics used for dresses, trousers, sweaters and shirts. The prompts were generated by pairing garment property and type.
  • ...and 6 more figures