Table of Contents
Fetching ...

Magic Clothing: Controllable Garment-Driven Image Synthesis

Weifeng Chen, Tao Gu, Yuhao Xu, Chengcai Chen

TL;DR

Magic Clothing tackles garment-driven image synthesis by preserving garment details while following text prompts. It introduces a garment extractor that feeds garment features into a frozen latent diffusion model via self-attention fusion, and uses joint classifier-free guidance to balance garment fidelity with prompt fidelity using scales $S_G$ and $S_T$. The garment extractor is a plug-in module compatible with ControlNet, IP-Adapter and LoRA-based LDMs, enabling diverse conditioning such as pose, face, and style. A robust MP-LPIPS metric assesses garment–image consistency; experiments on the VITON-HD dataset demonstrate state-of-the-art performance and strong controllability.

Abstract

We propose Magic Clothing, a latent diffusion model (LDM)-based network architecture for an unexplored garment-driven image synthesis task. Aiming at generating customized characters wearing the target garments with diverse text prompts, the image controllability is the most critical issue, i.e., to preserve the garment details and maintain faithfulness to the text prompts. To this end, we introduce a garment extractor to capture the detailed garment features, and employ self-attention fusion to incorporate them into the pretrained LDMs, ensuring that the garment details remain unchanged on the target character. Then, we leverage the joint classifier-free guidance to balance the control of garment features and text prompts over the generated results. Meanwhile, the proposed garment extractor is a plug-in module applicable to various finetuned LDMs, and it can be combined with other extensions like ControlNet and IP-Adapter to enhance the diversity and controllability of the generated characters. Furthermore, we design Matched-Points-LPIPS (MP-LPIPS), a robust metric for evaluating the consistency of the target image to the source garment. Extensive experiments demonstrate that our Magic Clothing achieves state-of-the-art results under various conditional controls for garment-driven image synthesis. Our source code is available at https://github.com/ShineChen1024/MagicClothing.

Magic Clothing: Controllable Garment-Driven Image Synthesis

TL;DR

Magic Clothing tackles garment-driven image synthesis by preserving garment details while following text prompts. It introduces a garment extractor that feeds garment features into a frozen latent diffusion model via self-attention fusion, and uses joint classifier-free guidance to balance garment fidelity with prompt fidelity using scales and . The garment extractor is a plug-in module compatible with ControlNet, IP-Adapter and LoRA-based LDMs, enabling diverse conditioning such as pose, face, and style. A robust MP-LPIPS metric assesses garment–image consistency; experiments on the VITON-HD dataset demonstrate state-of-the-art performance and strong controllability.

Abstract

We propose Magic Clothing, a latent diffusion model (LDM)-based network architecture for an unexplored garment-driven image synthesis task. Aiming at generating customized characters wearing the target garments with diverse text prompts, the image controllability is the most critical issue, i.e., to preserve the garment details and maintain faithfulness to the text prompts. To this end, we introduce a garment extractor to capture the detailed garment features, and employ self-attention fusion to incorporate them into the pretrained LDMs, ensuring that the garment details remain unchanged on the target character. Then, we leverage the joint classifier-free guidance to balance the control of garment features and text prompts over the generated results. Meanwhile, the proposed garment extractor is a plug-in module applicable to various finetuned LDMs, and it can be combined with other extensions like ControlNet and IP-Adapter to enhance the diversity and controllability of the generated characters. Furthermore, we design Matched-Points-LPIPS (MP-LPIPS), a robust metric for evaluating the consistency of the target image to the source garment. Extensive experiments demonstrate that our Magic Clothing achieves state-of-the-art results under various conditional controls for garment-driven image synthesis. Our source code is available at https://github.com/ShineChen1024/MagicClothing.
Paper Structure (24 sections, 8 equations, 6 figures, 3 tables)

This paper contains 24 sections, 8 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of our Magic Clothing. We propose a garment extractor that captures the garment features and incorporate these features into the denoising process in self-attention layers. Besides the paired garment and character images, we obtain the text prompts for training through BLIP li2022blip. Only the garment extractor requires additional training, which is a plug-in module compatible with other useful extensions like ControlNet zhang2023adding or IP-Adapter ye2023ip.
  • Figure 2: Example results with different text guidance scales $S_{T}$ and garment guidance scales $S_{G}$. With a larger $S_{T}$, the generated image becomes more faithful to the text prompt. While with a larger $S_{G}$, more garment details are preserved.
  • Figure 3: MP-LPIPS measures the consistency of the character (right column) to the garment (left column) by comparing patches centred on matched points. Given points in the source garment, we use diffusion features tang2023emergent to retrieve corresponding points in the target character.
  • Figure 4: Qualitative comparison with traditional subject-driven image synthesis methods, including IP-Adapter ye2023ip, BLIP-Diffusion li2024blip, Versatile Diffusion xu2023versatile, and ControlNet-Garment zhang2023adding.
  • Figure 5: Examples of plug-in results of our Magic Clothing combined with finetuned anime-style LDMs (1st row), ControlNet-Openpose (2nd row), ControlNet-Inpaint (3rd row), IP-Adapter-FaceID (4th row), and multiple extensions (5th row).
  • ...and 1 more figures