Table of Contents
Fetching ...

FashionMAC: Deformation-Free Fashion Image Generation with Fine-Grained Model Appearance Customization

Rong Zhang, Jinxiao Li, Jingnan Wang, Zhiwen Zuo, Jianfeng Dong, Wei Li, Chi Wang, Weiwei Xu, Xun Wang

TL;DR

FashionMAC introduces deformation-free garment-centric fashion image generation by outpainting dressed garment images, avoiding garment warping distortions. It combines a two-stage diffusion-based pipeline with a Multi-Scale Description Extractor, Region-Adaptive Decoupled Attention (RADA), and a Chained Mask Injection (CMI) strategy to achieve high-fidelity visuals and fine-grained appearance control guided by faces or text. The approach yields both qualitative and quantitative gains over baselines on garment detail preservation and attribute controllability, with strong applicability to e-commerce workflows. Limitations due to dataset bias are acknowledged, with future work aiming to broaden demographic diversity and attribute coverage to enhance generalization.

Abstract

Garment-centric fashion image generation aims to synthesize realistic and controllable human models dressing a given garment, which has attracted growing interest due to its practical applications in e-commerce. The key challenges of the task lie in two aspects: (1) faithfully preserving the garment details, and (2) gaining fine-grained controllability over the model's appearance. Existing methods typically require performing garment deformation in the generation process, which often leads to garment texture distortions. Also, they fail to control the fine-grained attributes of the generated models, due to the lack of specifically designed mechanisms. To address these issues, we propose FashionMAC, a novel diffusion-based deformation-free framework that achieves high-quality and controllable fashion showcase image generation. The core idea of our framework is to eliminate the need for performing garment deformation and directly outpaint the garment segmented from a dressed person, which enables faithful preservation of the intricate garment details. Moreover, we propose a novel region-adaptive decoupled attention (RADA) mechanism along with a chained mask injection strategy to achieve fine-grained appearance controllability over the synthesized human models. Specifically, RADA adaptively predicts the generated regions for each fine-grained text attribute and enforces the text attribute to focus on the predicted regions by a chained mask injection strategy, significantly enhancing the visual fidelity and the controllability. Extensive experiments validate the superior performance of our framework compared to existing state-of-the-art methods.

FashionMAC: Deformation-Free Fashion Image Generation with Fine-Grained Model Appearance Customization

TL;DR

FashionMAC introduces deformation-free garment-centric fashion image generation by outpainting dressed garment images, avoiding garment warping distortions. It combines a two-stage diffusion-based pipeline with a Multi-Scale Description Extractor, Region-Adaptive Decoupled Attention (RADA), and a Chained Mask Injection (CMI) strategy to achieve high-fidelity visuals and fine-grained appearance control guided by faces or text. The approach yields both qualitative and quantitative gains over baselines on garment detail preservation and attribute controllability, with strong applicability to e-commerce workflows. Limitations due to dataset bias are acknowledged, with future work aiming to broaden demographic diversity and attribute coverage to enhance generalization.

Abstract

Garment-centric fashion image generation aims to synthesize realistic and controllable human models dressing a given garment, which has attracted growing interest due to its practical applications in e-commerce. The key challenges of the task lie in two aspects: (1) faithfully preserving the garment details, and (2) gaining fine-grained controllability over the model's appearance. Existing methods typically require performing garment deformation in the generation process, which often leads to garment texture distortions. Also, they fail to control the fine-grained attributes of the generated models, due to the lack of specifically designed mechanisms. To address these issues, we propose FashionMAC, a novel diffusion-based deformation-free framework that achieves high-quality and controllable fashion showcase image generation. The core idea of our framework is to eliminate the need for performing garment deformation and directly outpaint the garment segmented from a dressed person, which enables faithful preservation of the intricate garment details. Moreover, we propose a novel region-adaptive decoupled attention (RADA) mechanism along with a chained mask injection strategy to achieve fine-grained appearance controllability over the synthesized human models. Specifically, RADA adaptively predicts the generated regions for each fine-grained text attribute and enforces the text attribute to focus on the predicted regions by a chained mask injection strategy, significantly enhancing the visual fidelity and the controllability. Extensive experiments validate the superior performance of our framework compared to existing state-of-the-art methods.

Paper Structure

This paper contains 31 sections, 4 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Given a garment image segmented from a dressed mannequin or a person, our method can generate fashion showcase images via garment-centric outpainting under the guidance of face images or fine-grained text attributes. From top to bottom: (1) The results conditioned on the automatically generated diverse pose maps. (2) The results conditioned on different face images. (3) The results conditioned on fine-grained text prompts.
  • Figure 2: (a): Framework comparison of fashion image generation methods. (b): Visualization of the cross attention-map. Note that our method can preserve details better and enable accurate fine-grained text customization.
  • Figure 3: The overview of our framework. It consists of a multi-scale description extractor and a region-adaptive decoupled attention mechanism. After training, our method can take facial images, overall showcase descriptions, or fine-grained text attributes as optional inputs to generate diverse showcase images.
  • Figure 4: The chained mask injection strategy.
  • Figure 5: Comparison with the baseline methods. The first two rows show the results with facial image guidance. The last two rows show results with text prompt guidance.
  • ...and 8 more figures