Table of Contents
Fetching ...

Training-free Clothing Region of Interest Self-correction for Virtual Try-On

Shengjie Lu, Zhibin Wan, Jiejie Liu, Quan Zhang, Mingjie Sun

TL;DR

The paper tackles the mismatch between generated and target clothing in diffusion-based virtual try-on by introducing Clothing Region of Interest Self-correction (CSC), a training-free module that steers denoising attention to the target clothing region via an energy-guided approach with attention-attract and attention-repel terms. It also introduces VTID, a unified metric for evaluating both paired and unpaired VTON outputs. Empirically, CSC yields state-of-the-art gains on VITON-HD and DressCode across LPIPS, FID, KID, and VTID, and improves downstream Clothing-Change Re-identification (CC-ReID) performance when used to augment training data. The work provides a practical, plug-and-play enhancement with public code for improved clothing detail preservation and alignment in virtual try-on systems.

Abstract

VTON (Virtual Try-ON) aims at synthesizing the target clothing on a certain person, preserving the details of the target clothing while keeping the rest of the person unchanged. Existing methods suffer from the discrepancies between the generated clothing results and the target ones, in terms of the patterns, textures and boundaries. Therefore, we propose to use an energy function to impose constraints on the attention map extracted through the generation process. Thus, at each generation step, the attention can be more focused on the clothing region of interest, thereby influencing the generation results to be more consistent with the target clothing details. Furthermore, to address the limitation that existing evaluation metrics concentrate solely on image realism and overlook the alignment with target elements, we design a new metric, Virtual Try-on Inception Distance (VTID), to bridge this gap and ensure a more comprehensive assessment. On the VITON-HD and DressCode datasets, our approach has outperformed the previous state-of-the-art (SOTA) methods by 1.4%, 2.3%, 12.3%, and 5.8% in the traditional metrics of LPIPS, FID, KID, and the new VTID metrics, respectively. Additionally, by applying the generated data to downstream Clothing-Change Re-identification (CC-Reid) methods, we have achieved performance improvements of 2.5%, 1.1%, and 1.6% on the LTCC, PRCC, VC-Clothes datasets in the metrics of Rank-1. The code of our method is public at https://github.com/MrWhiteSmall/CSC-VTON.git.

Training-free Clothing Region of Interest Self-correction for Virtual Try-On

TL;DR

The paper tackles the mismatch between generated and target clothing in diffusion-based virtual try-on by introducing Clothing Region of Interest Self-correction (CSC), a training-free module that steers denoising attention to the target clothing region via an energy-guided approach with attention-attract and attention-repel terms. It also introduces VTID, a unified metric for evaluating both paired and unpaired VTON outputs. Empirically, CSC yields state-of-the-art gains on VITON-HD and DressCode across LPIPS, FID, KID, and VTID, and improves downstream Clothing-Change Re-identification (CC-ReID) performance when used to augment training data. The work provides a practical, plug-and-play enhancement with public code for improved clothing detail preservation and alignment in virtual try-on systems.

Abstract

VTON (Virtual Try-ON) aims at synthesizing the target clothing on a certain person, preserving the details of the target clothing while keeping the rest of the person unchanged. Existing methods suffer from the discrepancies between the generated clothing results and the target ones, in terms of the patterns, textures and boundaries. Therefore, we propose to use an energy function to impose constraints on the attention map extracted through the generation process. Thus, at each generation step, the attention can be more focused on the clothing region of interest, thereby influencing the generation results to be more consistent with the target clothing details. Furthermore, to address the limitation that existing evaluation metrics concentrate solely on image realism and overlook the alignment with target elements, we design a new metric, Virtual Try-on Inception Distance (VTID), to bridge this gap and ensure a more comprehensive assessment. On the VITON-HD and DressCode datasets, our approach has outperformed the previous state-of-the-art (SOTA) methods by 1.4%, 2.3%, 12.3%, and 5.8% in the traditional metrics of LPIPS, FID, KID, and the new VTID metrics, respectively. Additionally, by applying the generated data to downstream Clothing-Change Re-identification (CC-Reid) methods, we have achieved performance improvements of 2.5%, 1.1%, and 1.6% on the LTCC, PRCC, VC-Clothes datasets in the metrics of Rank-1. The code of our method is public at https://github.com/MrWhiteSmall/CSC-VTON.git.

Paper Structure

This paper contains 18 sections, 9 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: The comparison illustrates the superior accuracy of our approach when juxtaposed with the conventional baseline method. Red circles pinpoint the baseline method's inaccuracies, while green circles show our method's corrections in clothing patterns, textures, boundaries, and alignment with the person's pose, respectively, which demonstrates a significant improvement in realism and precision.
  • Figure 2: Illustration of our framework. (a) Apart from random noise, the clothing image/noised agnostic person is passed into the clothing/human diffusion model as the input, respectively. The clothing diffusion model transforms the clothing into a feature to guide the denoising process of the human diffusion model. (b) The proposed CSC first extracts the attention corresponding to the prompt "upper". Then, the difference between the attention and the mask of clothing region of interest is calculated to obtain the gradient after back-propagation. Finally, the gradient is fused with the predicted noise as correction. The corrected result serves as the input for the next step.
  • Figure 3: Illustration of the new evaluation metric VTID.
  • Figure 4: Qualitative comparison on the VITON-HD dataset. Please zoom in for more details.
  • Figure 5: Illustration of the synthetic results with different scale factor values. Please zoom in for more details.
  • ...and 3 more figures