Table of Contents
Fetching ...

ACDG-VTON: Accurate and Contained Diffusion Generation for Virtual Try-On

Jeffrey Zhang, Kedan Li, Shao-Yu Chang, David Forsyth

TL;DR

This work proposes a unique training scheme that limits the scope in which diffusion is trained, and uses a control image that perfectly aligns with the target image during training to accurately preserves garment details during inference.

Abstract

Virtual Try-on (VTON) involves generating images of a person wearing selected garments. Diffusion-based methods, in particular, can create high-quality images, but they struggle to maintain the identities of the input garments. We identified this problem stems from the specifics in the training formulation for diffusion. To address this, we propose a unique training scheme that limits the scope in which diffusion is trained. We use a control image that perfectly aligns with the target image during training. In turn, this accurately preserves garment details during inference. We demonstrate our method not only effectively conserves garment details but also allows for layering, styling, and shoe try-on. Our method runs multi-garment try-on in a single inference cycle and can support high-quality zoomed-in generations without training in higher resolutions. Finally, we show our method surpasses prior methods in accuracy and quality.

ACDG-VTON: Accurate and Contained Diffusion Generation for Virtual Try-On

TL;DR

This work proposes a unique training scheme that limits the scope in which diffusion is trained, and uses a control image that perfectly aligns with the target image during training to accurately preserves garment details during inference.

Abstract

Virtual Try-on (VTON) involves generating images of a person wearing selected garments. Diffusion-based methods, in particular, can create high-quality images, but they struggle to maintain the identities of the input garments. We identified this problem stems from the specifics in the training formulation for diffusion. To address this, we propose a unique training scheme that limits the scope in which diffusion is trained. We use a control image that perfectly aligns with the target image during training. In turn, this accurately preserves garment details during inference. We demonstrate our method not only effectively conserves garment details but also allows for layering, styling, and shoe try-on. Our method runs multi-garment try-on in a single inference cycle and can support high-quality zoomed-in generations without training in higher resolutions. Finally, we show our method surpasses prior methods in accuracy and quality.
Paper Structure (32 sections, 6 equations, 18 figures, 1 table)

This paper contains 32 sections, 6 equations, 18 figures, 1 table.

Figures (18)

  • Figure 1: Accuracy refers to how well the generated items generate the details of the actual garments. Our method preserves garment details such as graphics, text, and patterns better than diffusion-based try-on methods. LaDI-VTON Morelli2023LadiVTON and StableVITON Kim2023StableVITON are not accurate because both methods alter and hallucinate details on the garment. Furthermore, our proposed HR Zoom method can make additional improvements to preserve the details that VAEs may distort (e.g. text and small patterns).
  • Figure 2: VAEs Kingma2014 can cause loss of high-frequency details. If we take a 512x512 crop from a larger image and encode and decode the image through a VAE, certain details will be altered because the VAE dictionary cannot reconstruct high-frequency features. The button's colors are changed, the angel's face and body are altered, and the pattern near the hem differs (see red arrows). We can work around this issue by upsampling the 512x512 crop to 1024x1024. The details are better preserved if we encode and decode the 1024x1024 image through VAE (see green arrows).
  • Figure 3: Inference procedure. We want model image $m$ to wear garment $g$. We apply the Semantic Layout Gen $H$ and Warper $W$ to get the semantic layout parsing $p$ and warped garment $g^w$ (we use $W$ and $G$ from Li2024Controlling, but expect others (e.g., tprvtonIssenhuth2020DoNMKedan_Li_2021_CVPRChopra_2021_ICCV) would work as well). We overlay warped garment $g^w$ on model image $m$, and we use $p$ to fill in the skin with the median pixel value from the face to get our incomplete image $m^i$ during inference (the control image is different for training; see Fig. \ref{['fig:warp_training']}). For full body inference, we pass $m^i$ as the initialization, control, and image CLIP embedding to our trained diffusion model $F$ to get the try-on output $\hat{m}^g$ (architecture in Figure \ref{['fig:training_architecture']}). Notice our trained denoiser $F$ preserves garment details while fixing bad strap segmentations on the dress and the bag. For HR Zoom inference, we crop and upsample $m^i$ to create $m^i_{zoom}$. We run inference through the same trained denoiser $F$ to generate a close-up generation $\hat{m}^g_{zoom}$ that has the same resolution as $\hat{m}^g$.
  • Figure 4: We adopt ControlNet Zhang2023ControlNet architecture to take in a simulated incomplete model image $s^i$ as a control to help preserve the identities of the garments. In addition, the control $s^i$ is used as the noisy image for the first few timesteps during training, and the ground truth model $m^g$ is used for the noisy image otherwise (Sec. \ref{['sec:training_method']}, "Control Initialization"). An smplx joint image $j$ is concatenated to the noisy image as input to the diffusion network to better control the generation of arms and fingers. The control image $s^i$ is fed into a CLIP image encoder ramesh2022 and used as the embedding condition for the diffusion network. Finally, an MSE loss is applied to the predicted noise compared to the added noise (we show images $\hat{m}^g$ and $m^g$ in the figure for simplicity).
  • Figure 5: We show how the simulated incomplete image $s^i$ is created for training denoiser $F$. The first step is to train a U-Net UNet to take a garment in a model image and convert it to a warp distribution. The next step is to run the U-Net on each garment in a person image to create "reverse warps" for each garment. This alters each garment to look like warped garments, but crucially, all the garment features are aligned. Finally, the incomplete image is created by filling in the skin with a constant skin value derived from the median pixel value of the face.
  • ...and 13 more figures