Fine-Grained Controllable Apparel Showcase Image Generation via Garment-Centric Outpainting
Rong Zhang, Jingnan Wang, Zhiwen Zuo, Jianfeng Dong, Wei Li, Chi Wang, Weiwei Xu, Xun Wang
TL;DR
This work addresses the need for fine-grained controllable apparel showcase image generation that preserves garment details. It introduces garment-centric outpainting (GCO) built on latent diffusion models, comprising a garment-adaptive pose predictor and a garment-centric outpainting module with a lightweight feature fusion and a multi-scale appearance customization module (MS-ACM). MS-ACM enables both overall appearance control via Showcase BLIP and fine-grained facial control via Face BLIP, with prompts integrated through cross-attention in the diffusion model. Experiments on the VITON-HD dataset show that GCO outperforms state-of-the-art methods in realism and controllability, supported by extensive ablation studies and a user study. The approach offers a practical, data-efficient path for generating customizable apparel showcase imagery for e-commerce without the need for paired data or garment warping.
Abstract
In this paper, we propose a novel garment-centric outpainting (GCO) framework based on the latent diffusion model (LDM) for fine-grained controllable apparel showcase image generation. The proposed framework aims at customizing a fashion model wearing a given garment via text prompts and facial images. Different from existing methods, our framework takes a garment image segmented from a dressed mannequin or a person as the input, eliminating the need for learning cloth deformation and ensuring faithful preservation of garment details. The proposed framework consists of two stages. In the first stage, we introduce a garment-adaptive pose prediction model that generates diverse poses given the garment. Then, in the next stage, we generate apparel showcase images, conditioned on the garment and the predicted poses, along with specified text prompts and facial images. Notably, a multi-scale appearance customization module (MS-ACM) is designed to allow both overall and fine-grained text-based control over the generated model's appearance. Moreover, we leverage a lightweight feature fusion operation without introducing any extra encoders or modules to integrate multiple conditions, which is more efficient. Extensive experiments validate the superior performance of our framework compared to state-of-the-art methods.
