Table of Contents
Fetching ...

Enhancing Virtual Try-On with Synthetic Pairs and Error-Aware Noise Scheduling

Nannan Li, Kevin J. Shih, Bryan A. Plummer

TL;DR

This paper tackles two key problems in virtual try-on: scarce paired training data and texture distortions in generated garments. It introduces a human-to-garment (H2G) model to synthesize (human, synthetic garment) pairs from single images, enabling data augmentation without copyright issues, and an Error-Aware Refinement Schrödinger Bridge (EARSB) that locally refines artifacts using a weakly-supervised error map to adapt the diffusion noise schedule. The approach yields consistent improvements over prior methods on VITON-HD and DressCode-Upper, with synthetic data boosting performance and EARSB enhancing overall image fidelity, including texture and text graphics, and achieving 59% user preference. Together, the synthetic data augmentation and targeted diffusion-based refinement meaningfully advance photorealistic virtual try-on and offer practical gains for deployment. The work also provides a principled framework for localized refinement in diffusion models, applicable to other conditional image generation tasks requiring region-specific corrections.

Abstract

Given an isolated garment image in a canonical product view and a separate image of a person, the virtual try-on task aims to generate a new image of the person wearing the target garment. Prior virtual try-on works face two major challenges in achieving this goal: a) the paired (human, garment) training data has limited availability; b) generating textures on the human that perfectly match that of the prompted garment is difficult, often resulting in distorted text and faded textures. Our work explores ways to tackle these issues through both synthetic data as well as model refinement. We introduce a garment extraction model that generates (human, synthetic garment) pairs from a single image of a clothed individual. The synthetic pairs can then be used to augment the training of virtual try-on. We also propose an Error-Aware Refinement-based Schrödinger Bridge (EARSB) that surgically targets localized generation errors for correcting the output of a base virtual try-on model. To identify likely errors, we propose a weakly-supervised error classifier that localizes regions for refinement, subsequently augmenting the Schrödinger Bridge's noise schedule with its confidence heatmap. Experiments on VITON-HD and DressCode-Upper demonstrate that our synthetic data augmentation enhances the performance of prior work, while EARSB improves the overall image quality. In user studies, our model is preferred by the users in an average of 59% of cases.

Enhancing Virtual Try-On with Synthetic Pairs and Error-Aware Noise Scheduling

TL;DR

This paper tackles two key problems in virtual try-on: scarce paired training data and texture distortions in generated garments. It introduces a human-to-garment (H2G) model to synthesize (human, synthetic garment) pairs from single images, enabling data augmentation without copyright issues, and an Error-Aware Refinement Schrödinger Bridge (EARSB) that locally refines artifacts using a weakly-supervised error map to adapt the diffusion noise schedule. The approach yields consistent improvements over prior methods on VITON-HD and DressCode-Upper, with synthetic data boosting performance and EARSB enhancing overall image fidelity, including texture and text graphics, and achieving 59% user preference. Together, the synthetic data augmentation and targeted diffusion-based refinement meaningfully advance photorealistic virtual try-on and offer practical gains for deployment. The work also provides a principled framework for localized refinement in diffusion models, applicable to other conditional image generation tasks requiring region-specific corrections.

Abstract

Given an isolated garment image in a canonical product view and a separate image of a person, the virtual try-on task aims to generate a new image of the person wearing the target garment. Prior virtual try-on works face two major challenges in achieving this goal: a) the paired (human, garment) training data has limited availability; b) generating textures on the human that perfectly match that of the prompted garment is difficult, often resulting in distorted text and faded textures. Our work explores ways to tackle these issues through both synthetic data as well as model refinement. We introduce a garment extraction model that generates (human, synthetic garment) pairs from a single image of a clothed individual. The synthetic pairs can then be used to augment the training of virtual try-on. We also propose an Error-Aware Refinement-based Schrödinger Bridge (EARSB) that surgically targets localized generation errors for correcting the output of a base virtual try-on model. To identify likely errors, we propose a weakly-supervised error classifier that localizes regions for refinement, subsequently augmenting the Schrödinger Bridge's noise schedule with its confidence heatmap. Experiments on VITON-HD and DressCode-Upper demonstrate that our synthetic data augmentation enhances the performance of prior work, while EARSB improves the overall image quality. In user studies, our model is preferred by the users in an average of 59% of cases.
Paper Structure (41 sections, 6 equations, 15 figures, 9 tables)

This paper contains 41 sections, 6 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: Example of our proposed Error-Aware Refinement Schrödinger Bridge (EARSB). EARSB can refine the artifacts (marked by bounding boxes) in an initial image generated by an existing try-on model. The initial image is generated by stableviton in the top row and by sd-viton in the bottom row. + Syn. Data in the last column strengthens the refinement with the proposed synthetic data augmentation in training.
  • Figure 2: (a) Our human-to-garment model, which is explained in \ref{['sec:h2g']} (b) Examples of the constructed (human, synthetic garment) pairs in \ref{['sec:h2g']}.
  • Figure 3: The diffusion process in our refinement-based EARSB. We first preprocess the input image, then use a base try-on model that takes the masked human image $\bar{x}_0$, its pose representation $P$, and its garment $C$ as input to generate an initial human image $x_1$. $x_1$ is fed to our weakly-supervised classifier (WSC) to obtain the error map $M$ (see \ref{['sec:classifier']}). This map reweights the noise distribution $\epsilon$ to $\epsilon^r$ in I$^2$SB diffusion and refines $x_1$ that has generation errors to the ground truth image $x_0$ (see \ref{['sec:pipeline']}).
  • Figure 4: Results on VITON-HD at 5, 10, 25, 50, and 100 sampling steps. Our method consistently improves our baseline starting model GP-VTON (black, dotted line), making it competitive with StableVITON (especially at under 50 sampling steps). Legend is shared for all.
  • Figure 5: Visualizations on VITON-HD (top row) and DressCode (bottom row). Our EARSB+H2G-UH and EARSB(SD)+H2G-UH better recover the intricate textures in the garment.
  • ...and 10 more figures