Table of Contents
Fetching ...

Controllable Human Image Generation with Personalized Multi-Garments

Yisol Choi, Sangkyung Kwak, Sihyun Yu, Hyungwon Choi, Jinwoo Shin

TL;DR

BootComp tackles data bottlenecks in controllable human image generation with multiple garments by introducing a two-stage approach: a decomposition-based synthetic data generation pipeline and a composition diffusion module that fuses multiple garment conditions. It employs extended self-attention to inject garment features and trains an encoder while keeping the generator frozen, enabling flexible downstream tasks such as virtual try-on, pose-guided, and stylized generation without task-specific fine-tuning. The method achieves state-of-the-art garment fidelity and compositionality, demonstrated through quantitative improvements and diverse applications. By reducing data collection costs and enabling multi-garment controllability in diffusion models, BootComp has practical implications for personalized fashion generation and related AI-assisted design workflows.

Abstract

We present BootComp, a novel framework based on text-to-image diffusion models for controllable human image generation with multiple reference garments. Here, the main bottleneck is data acquisition for training: collecting a large-scale dataset of high-quality reference garment images per human subject is quite challenging, i.e., ideally, one needs to manually gather every single garment photograph worn by each human. To address this, we propose a data generation pipeline to construct a large synthetic dataset, consisting of human and multiple-garment pairs, by introducing a model to extract any reference garment images from each human image. To ensure data quality, we also propose a filtering strategy to remove undesirable generated data based on measuring perceptual similarities between the garment presented in human image and extracted garment. Finally, by utilizing the constructed synthetic dataset, we train a diffusion model having two parallel denoising paths that use multiple garment images as conditions to generate human images while preserving their fine-grained details. We further show the wide-applicability of our framework by adapting it to different types of reference-based generation in the fashion domain, including virtual try-on, and controllable human image generation with other conditions, e.g., pose, face, etc.

Controllable Human Image Generation with Personalized Multi-Garments

TL;DR

BootComp tackles data bottlenecks in controllable human image generation with multiple garments by introducing a two-stage approach: a decomposition-based synthetic data generation pipeline and a composition diffusion module that fuses multiple garment conditions. It employs extended self-attention to inject garment features and trains an encoder while keeping the generator frozen, enabling flexible downstream tasks such as virtual try-on, pose-guided, and stylized generation without task-specific fine-tuning. The method achieves state-of-the-art garment fidelity and compositionality, demonstrated through quantitative improvements and diverse applications. By reducing data collection costs and enabling multi-garment controllability in diffusion models, BootComp has practical implications for personalized fashion generation and related AI-assisted design workflows.

Abstract

We present BootComp, a novel framework based on text-to-image diffusion models for controllable human image generation with multiple reference garments. Here, the main bottleneck is data acquisition for training: collecting a large-scale dataset of high-quality reference garment images per human subject is quite challenging, i.e., ideally, one needs to manually gather every single garment photograph worn by each human. To address this, we propose a data generation pipeline to construct a large synthetic dataset, consisting of human and multiple-garment pairs, by introducing a model to extract any reference garment images from each human image. To ensure data quality, we also propose a filtering strategy to remove undesirable generated data based on measuring perceptual similarities between the garment presented in human image and extracted garment. Finally, by utilizing the constructed synthetic dataset, we train a diffusion model having two parallel denoising paths that use multiple garment images as conditions to generate human images while preserving their fine-grained details. We further show the wide-applicability of our framework by adapting it to different types of reference-based generation in the fashion domain, including virtual try-on, and controllable human image generation with other conditions, e.g., pose, face, etc.

Paper Structure

This paper contains 25 sections, 7 equations, 19 figures, 4 tables.

Figures (19)

  • Figure 1: Generated images by BootComp. (a) BootComp generates high-quality human images wearing multiple reference garments, with support for extended categories such as bag, shoes, even in unusual garment combinations (e.g., swimming suit with soccer cleats). We show BootComp's generalization capability through various conditional image generations, such as (b) virtual try-on, (c) pose guided generation, (d) stylization, and (e) text guided generation, even though BootComp is not directly trained or fine-tuned for each task.
  • Figure 2: Limitations of previous data curation approaches used in controllable generation. Previous approaches on controllable generation often use a paired dataset consisting of low-quality segmented garments and human images for training. It leads to several undesirable artifacts as shown in right (generated with baselines). For example, garments are directly replicated from the reference images in (a), shirts and skirts are blended together in (b), and generated skirts fail to resemble the reference in (c).
  • Figure 3: Overview of BootComp. We propose a two-stage framework: synthetic data generation and composition module training for controllable human image generation. (a) We train a decomposition network that maps from a segmented garment image to a product garment image. (b) We bootstrap synthetic paired data of human and multiple garment images. (c) We finally train our composition module with the synthetic paired dataset enabling it to generate human images with multiple reference garment images.
  • Figure 4: Extended self-attention architecture. In a extended self-attention layer, reference hidden states are concatenated with the target hidden states in the key and value matrices. This architecture enables injecting reference image features within the target image. Note that decomposition module also uses same structure but works within a single network.
  • Figure 5: Examples of high&low-quality generated garments. When human parsing results are not precise, the decomposition network struggles to generate product garment images accurately, resulting in low-quality garment images. We filter out these cases.
  • ...and 14 more figures