Table of Contents
Fetching ...

AnyDesign: Versatile Area Fashion Editing via Mask-Free Diffusion

Yunfang Niu, Lingxiang Wu, Dong Yi, Jie Peng, Ning Jiang, Haiying Wu, Jinqiao Wang

TL;DR

This paper extends an existing dataset for human generation to include a wider range of apparel and more complex backgrounds and proposes AnyDesign, a diffusion-based method that enables mask-free editing on versatile areas and outperforms contemporary text-guided fashion editing methods.

Abstract

Fashion image editing aims to modify a person's appearance based on a given instruction. Existing methods require auxiliary tools like segmenters and keypoint extractors, lacking a flexible and unified framework. Moreover, these methods are limited in the variety of clothing types they can handle, as most datasets focus on people in clean backgrounds and only include generic garments such as tops, pants, and dresses. These limitations restrict their applicability in real-world scenarios. In this paper, we first extend an existing dataset for human generation to include a wider range of apparel and more complex backgrounds. This extended dataset features people wearing diverse items such as tops, pants, dresses, skirts, headwear, scarves, shoes, socks, and bags. Additionally, we propose AnyDesign, a diffusion-based method that enables mask-free editing on versatile areas. Users can simply input a human image along with a corresponding prompt in either text or image format. Our approach incorporates Fashion DiT, equipped with a Fashion-Guidance Attention (FGA) module designed to fuse explicit apparel types and CLIP-encoded apparel features. Both Qualitative and quantitative experiments demonstrate that our method delivers high-quality fashion editing and outperforms contemporary text-guided fashion editing methods.

AnyDesign: Versatile Area Fashion Editing via Mask-Free Diffusion

TL;DR

This paper extends an existing dataset for human generation to include a wider range of apparel and more complex backgrounds and proposes AnyDesign, a diffusion-based method that enables mask-free editing on versatile areas and outperforms contemporary text-guided fashion editing methods.

Abstract

Fashion image editing aims to modify a person's appearance based on a given instruction. Existing methods require auxiliary tools like segmenters and keypoint extractors, lacking a flexible and unified framework. Moreover, these methods are limited in the variety of clothing types they can handle, as most datasets focus on people in clean backgrounds and only include generic garments such as tops, pants, and dresses. These limitations restrict their applicability in real-world scenarios. In this paper, we first extend an existing dataset for human generation to include a wider range of apparel and more complex backgrounds. This extended dataset features people wearing diverse items such as tops, pants, dresses, skirts, headwear, scarves, shoes, socks, and bags. Additionally, we propose AnyDesign, a diffusion-based method that enables mask-free editing on versatile areas. Users can simply input a human image along with a corresponding prompt in either text or image format. Our approach incorporates Fashion DiT, equipped with a Fashion-Guidance Attention (FGA) module designed to fuse explicit apparel types and CLIP-encoded apparel features. Both Qualitative and quantitative experiments demonstrate that our method delivers high-quality fashion editing and outperforms contemporary text-guided fashion editing methods.
Paper Structure (35 sections, 5 equations, 16 figures, 7 tables)

This paper contains 35 sections, 5 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Fashion Editing with AnyDesign. Our model adapts to various settings and edits a wide range of apparel categories using flexible prompts.
  • Figure 2: (a) The Dataset Extension Method. We extract keypoints and densepose information using existing methods. Then, apparel-specific extractors are designed to create agnostic images and guidance prompts. (b) Different feature removal strategies.
  • Figure 3: The Overall Architecture of Fashion Editing Framework. (a) Two-stage Image Training Framework. In Stage I, we train a mask-based model to generate pseudo-samples using unpaired text prompts or image prompts. In Stage II, we train the final mask-free utilizing generated pseudo-samples with the paired prompts and the apparel types as inputs. The training goal at this stage is to generate realistic images. (b) The architecture of Fashion DiT.
  • Figure 4: Fashion-Guidance Attention (FGA) Module.
  • Figure 5: Visual Comparison on VITON-HD and Dresscode images. From left to right: the given person, the text-driven editing results by a series of methods.
  • ...and 11 more figures