Improving Virtual Try-On with Garment-focused Diffusion Models

Siqi Wan; Yehao Li; Jingwen Chen; Yingwei Pan; Ting Yao; Yang Cao; Tao Mei

Improving Virtual Try-On with Garment-focused Diffusion Models

Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, Tao Mei

TL;DR

This work shapes a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process with amplified guidance of both basic visual appearance and detailed textures derived from the given garment, and designs an appearance loss over the synthesized garment to enhance the crucial, high-frequency details.

Abstract

Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks. Nevertheless, it is not trivial to directly apply diffusion models for synthesizing an image of a target person wearing a given in-shop garment, i.e., image-based virtual try-on (VTON) task. The difficulty originates from the aspect that the diffusion process should not only produce holistically high-fidelity photorealistic image of the target person, but also locally preserve every appearance and texture detail of the given garment. To address this, we shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process with amplified guidance of both basic visual appearance and detailed textures (i.e., high-frequency details) derived from the given garment. GarDiff first remoulds a pre-trained latent diffusion model with additional appearance priors derived from the CLIP and VAE encodings of the reference garment. Meanwhile, a novel garment-focused adapter is integrated into the UNet of diffusion model, pursuing local fine-grained alignment with the visual appearance of reference garment and human pose. We specifically design an appearance loss over the synthesized garment to enhance the crucial, high-frequency details. Extensive experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches. Code is publicly available at: \href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}.

Improving Virtual Try-On with Garment-focused Diffusion Models

TL;DR

Abstract

Paper Structure (16 sections, 10 equations, 7 figures, 3 tables)

This paper contains 16 sections, 10 equations, 7 figures, 3 tables.

Introduction
Related Work
METHOD
Overview
Garment-Focused Adapter
Appearance Loss
Experiments
Experimental Settings
Quantitative Results
Qualitative Results
Analysis and Discussions
Ablation Study on GarDiff.
Effect of Unwarpped Garment.
Preservation of Fine-grained Details.
Conclusion
...and 1 more sections

Figures (7)

Figure 1: Existing GAN-based VTON methods (e.g., VITON-HD viton-hd, HR-VTON hr-vton and GP-VTON gp-vton) and Diffusion-based VTON techniques (e.g., LaDI-VTON ladi-vton and DCI-VTON dci-vton) often fail to perfectly retain every appearance/texture detail of the given garment (e.g., the complex patterns or texts). Instead, our GarDiff exploits garment-focused diffusion process to preserve most of fine-grained details of the given garment, pursuing more controllable person image generation.
Figure 2: An overview of our GarDiff. The cross-attention layer is substituted with the garment-focused vision adapter in each Transformer block. First, we extract the CLIP visual embeddings $\mathbf{f}_{clip}$ and VAE embeddings $\mathbf{f}_{vae}$ of the target garment $\mathbf{I}_c$ and warped garment $\mathbf{I}_w$, respectively. Then the two embeddings are fed into the garment-focused adapter as keys and values via a decoupled cross-attention to guide the diffusion process for pursuing local fine-grained alignment with the appearance of target garment. Meanwhile, we employ a novel appearance loss $\mathcal{L}_{appearance}$ comprised of spatial perceptual loss $\mathcal{L}_{spatial}$ and high-frequency promoted loss $\mathcal{L}_{high\text{-}freq}$ over the generated garment to enhance the proficiency of GarDiff in generating high-frequency details.
Figure 3: Implementation details of our garment-focused adapter. For the given target garment $\mathbf{I}_c$ and warped garment $\mathbf{I}_w$, the CLIP visual embeddings $\mathbf{f}_{clip}$ and VAE embeddings $\mathbf{f}_{vae}$ are extracted and fed into the garment-focused adapter as the keys and values through a decoupled cross-attention. $\mathbf{M}_{attn}$ is used to suppress the weights unrelated to garment area in the attention map for generating garment-focused features.
Figure 4: Examples generated by VITON-HD, HR-VTON, GP-VTON, LaDI-VTON, DCI-VTON and our GarDiff.
Figure 5: User study on 100 garment-person pairs randomly sampled from VITON-HD.
...and 2 more figures

Improving Virtual Try-On with Garment-focused Diffusion Models

TL;DR

Abstract

Improving Virtual Try-On with Garment-focused Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)