Table of Contents
Fetching ...

Fine-Grained Controllable Apparel Showcase Image Generation via Garment-Centric Outpainting

Rong Zhang, Jingnan Wang, Zhiwen Zuo, Jianfeng Dong, Wei Li, Chi Wang, Weiwei Xu, Xun Wang

TL;DR

This work addresses the need for fine-grained controllable apparel showcase image generation that preserves garment details. It introduces garment-centric outpainting (GCO) built on latent diffusion models, comprising a garment-adaptive pose predictor and a garment-centric outpainting module with a lightweight feature fusion and a multi-scale appearance customization module (MS-ACM). MS-ACM enables both overall appearance control via Showcase BLIP and fine-grained facial control via Face BLIP, with prompts integrated through cross-attention in the diffusion model. Experiments on the VITON-HD dataset show that GCO outperforms state-of-the-art methods in realism and controllability, supported by extensive ablation studies and a user study. The approach offers a practical, data-efficient path for generating customizable apparel showcase imagery for e-commerce without the need for paired data or garment warping.

Abstract

In this paper, we propose a novel garment-centric outpainting (GCO) framework based on the latent diffusion model (LDM) for fine-grained controllable apparel showcase image generation. The proposed framework aims at customizing a fashion model wearing a given garment via text prompts and facial images. Different from existing methods, our framework takes a garment image segmented from a dressed mannequin or a person as the input, eliminating the need for learning cloth deformation and ensuring faithful preservation of garment details. The proposed framework consists of two stages. In the first stage, we introduce a garment-adaptive pose prediction model that generates diverse poses given the garment. Then, in the next stage, we generate apparel showcase images, conditioned on the garment and the predicted poses, along with specified text prompts and facial images. Notably, a multi-scale appearance customization module (MS-ACM) is designed to allow both overall and fine-grained text-based control over the generated model's appearance. Moreover, we leverage a lightweight feature fusion operation without introducing any extra encoders or modules to integrate multiple conditions, which is more efficient. Extensive experiments validate the superior performance of our framework compared to state-of-the-art methods.

Fine-Grained Controllable Apparel Showcase Image Generation via Garment-Centric Outpainting

TL;DR

This work addresses the need for fine-grained controllable apparel showcase image generation that preserves garment details. It introduces garment-centric outpainting (GCO) built on latent diffusion models, comprising a garment-adaptive pose predictor and a garment-centric outpainting module with a lightweight feature fusion and a multi-scale appearance customization module (MS-ACM). MS-ACM enables both overall appearance control via Showcase BLIP and fine-grained facial control via Face BLIP, with prompts integrated through cross-attention in the diffusion model. Experiments on the VITON-HD dataset show that GCO outperforms state-of-the-art methods in realism and controllability, supported by extensive ablation studies and a user study. The approach offers a practical, data-efficient path for generating customizable apparel showcase imagery for e-commerce without the need for paired data or garment warping.

Abstract

In this paper, we propose a novel garment-centric outpainting (GCO) framework based on the latent diffusion model (LDM) for fine-grained controllable apparel showcase image generation. The proposed framework aims at customizing a fashion model wearing a given garment via text prompts and facial images. Different from existing methods, our framework takes a garment image segmented from a dressed mannequin or a person as the input, eliminating the need for learning cloth deformation and ensuring faithful preservation of garment details. The proposed framework consists of two stages. In the first stage, we introduce a garment-adaptive pose prediction model that generates diverse poses given the garment. Then, in the next stage, we generate apparel showcase images, conditioned on the garment and the predicted poses, along with specified text prompts and facial images. Notably, a multi-scale appearance customization module (MS-ACM) is designed to allow both overall and fine-grained text-based control over the generated model's appearance. Moreover, we leverage a lightweight feature fusion operation without introducing any extra encoders or modules to integrate multiple conditions, which is more efficient. Extensive experiments validate the superior performance of our framework compared to state-of-the-art methods.

Paper Structure

This paper contains 21 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Given a garment image segmented from a dressed mannequin or a person, our method can generate apparel showcase images via garment-centric outpainting under the guidance of face images or fine-grained text attributes. From top to bottom: (1) The results conditioned on the automatically generated diverse pose maps. (2) The results conditioned on different face images. (3) The results conditioned on fine-grained text prompts.
  • Figure 2: Task comparison of the existing fashion-related image generation methods. (a): virtual try-on based method; (b): the showcase generation method, where b.1 is the existing method and b.2 is our outpainting-based method. Note that our method can preserve the garment details better and enable fine-grained text prompts for customization.
  • Figure 3: The overview of our GCO framework. (a): The pipeline of the garment-adaptive pose predictor. In the inference stage, we can utilize it to sample diverse pose maps that fit a target garment. (b): The training process of the outpainting-based showcase generation stage. It consists of a multi-scale appearance customization module and a lightweight feature fusion operation. For the multi-scale appearance customization module, our method extracts the overall image descriptions as coarse conditions and the fine-grained face attributes as detailed conditions through two different BLIP models. For the lightweight feature fusion, we fuse the multiple input conditions including the garment, the pose, the mask, and the facial features with the image to be generated through spatial or channel-wise concatenation. After training, our method can take facial images, overall showcase descriptions, or fine-grained text attributes as optional inputs to generate diverse showcase images.
  • Figure 4: Comparison with the baseline methods. The first two rows show paired results where the input garment and face are from the same original showcase image (The first column). The last two rows show unpaired results where the input garment and face are from different images.
  • Figure 5: Comparison of multi-scale appearance customization module. The overall text prompt is marked in green while the fine-grained attributes are marked in blue. The attributes that may be mismatched are highlighted in bold.
  • ...and 1 more figures