Table of Contents
Fetching ...

Rethinking Garment Conditioning in Diffusion-based Virtual Try-On

Kihyun Na, Jinyoung Choi, Injung Kim

TL;DR

Rethinking Garment Conditioning in Diffusion-based Virtual Try-On identifies the substantial computational cost of dual UNet VTON approaches and proposes Re-CatVTON, a single-UNet model guided by hypotheses on how to learn timesteps-aligned, garment-context features. The authors implement full UNet fine-tuning, timestep-aware conditioning, garment-region loss exclusion, improved classifier-free guidance, and ground-truth latent injection to mitigate error propagation, achieving state-of-the-art-like performance on VITON-HD and competitive results on DressCode with significantly lower compute than dual UNet models. Through visualization and theoretical analysis, the work shows that careful conditioning design can close much of the fidelity gap between single- and dual-UNet VTONs, enabling practical deployment. The approach yields strong improvements in FID, KID, and LPIPS with only a marginal SSIM trade-off, offering a practical efficiency–performance balance for diffusion-based virtual try-on systems.

Abstract

Virtual Try-On (VTON) is the task of synthesizing an image of a person wearing a target garment, conditioned on a person image and a garment image. While diffusion-based VTON models featuring a Dual UNet architecture demonstrate superior fidelity compared to single UNet models, they incur substantial computational and memory overhead due to their heavy structure. In this study, through visualization analysis and theoretical analysis, we derived three hypotheses regarding the learning of context features to condition the denoising process. Based on these hypotheses, we developed Re-CatVTON, an efficient single UNet model that achieves high performance. We further enhance the model by introducing a modified classifier-free guidance strategy tailored for VTON's spatial concatenation conditioning, and by directly injecting the ground-truth garment latent derived from the clean garment latent to prevent the accumulation of prediction error. The proposed Re-CatVTON significantly improves performance compared to its predecessor (CatVTON) and requires less computation and memory than the high-performance Dual UNet model, Leffa. Our results demonstrate improved FID, KID, and LPIPS scores, with only a marginal decrease in SSIM, establishing a new efficiency-performance trade-off for single UNet VTON models.

Rethinking Garment Conditioning in Diffusion-based Virtual Try-On

TL;DR

Rethinking Garment Conditioning in Diffusion-based Virtual Try-On identifies the substantial computational cost of dual UNet VTON approaches and proposes Re-CatVTON, a single-UNet model guided by hypotheses on how to learn timesteps-aligned, garment-context features. The authors implement full UNet fine-tuning, timestep-aware conditioning, garment-region loss exclusion, improved classifier-free guidance, and ground-truth latent injection to mitigate error propagation, achieving state-of-the-art-like performance on VITON-HD and competitive results on DressCode with significantly lower compute than dual UNet models. Through visualization and theoretical analysis, the work shows that careful conditioning design can close much of the fidelity gap between single- and dual-UNet VTONs, enabling practical deployment. The approach yields strong improvements in FID, KID, and LPIPS with only a marginal SSIM trade-off, offering a practical efficiency–performance balance for diffusion-based virtual try-on systems.

Abstract

Virtual Try-On (VTON) is the task of synthesizing an image of a person wearing a target garment, conditioned on a person image and a garment image. While diffusion-based VTON models featuring a Dual UNet architecture demonstrate superior fidelity compared to single UNet models, they incur substantial computational and memory overhead due to their heavy structure. In this study, through visualization analysis and theoretical analysis, we derived three hypotheses regarding the learning of context features to condition the denoising process. Based on these hypotheses, we developed Re-CatVTON, an efficient single UNet model that achieves high performance. We further enhance the model by introducing a modified classifier-free guidance strategy tailored for VTON's spatial concatenation conditioning, and by directly injecting the ground-truth garment latent derived from the clean garment latent to prevent the accumulation of prediction error. The proposed Re-CatVTON significantly improves performance compared to its predecessor (CatVTON) and requires less computation and memory than the high-performance Dual UNet model, Leffa. Our results demonstrate improved FID, KID, and LPIPS scores, with only a marginal decrease in SSIM, establishing a new efficiency-performance trade-off for single UNet VTON models.

Paper Structure

This paper contains 27 sections, 10 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Try-on images generated by our Re-CatVTON on the DressCode datasets. Please zoom in for more details.
  • Figure 2: Structural comparison of different try-on methods. Dual diffusion–based approaches employ a separate reference network to guide the try-on network, whereas single diffusion–based models use only one network to reconstruct both person and garment. Our Re-CatVTON follows single diffusion-based model but excluded loss calculation in garment region, thereby preserving its function as a pure contextual feature provider.
  • Figure 3: Visualization of the predicted noise $\boldsymbol{\epsilon}^g_t$ from the reference UNets across key timesteps ($t{=}1000, 500, 1$). IDM-VTON shows severe timestep inconsistency, producing strong garment features only at early steps while collapsing into noise near $t{=}1$. OOTDiffusion generates strong but timestep-invariant features, indicating that its reference UNet does not respond to diffusion progress. Leffa, although not fully stable across timesteps, reflects the garment structure more consistently than the baselines and exhibits clearer timestep-dependent behavior.
  • Figure 4: Architecture of Re-CatVTON. The masked-person and garment images are encoded into disentangled VAE latents, which are spatially fused to construct a time-aligned prior crucial for accurate garment–body correspondence. This prior is progressively refined by diffusion UNet, where each component contributes to stabilizing the denoising trajectory. The sampler converts predicted noise into updated latents, and the final latent is decoded to obtain the try-on output.
  • Figure 5: Qualitative comparison of VTON models on VITON-HD. (a) While CatVTON often introduces shape distortion or loses fine details, Re-CatVTON generates clearer structure and more faithful garment shapes. Compared with Leffa, our results are of comparable quality, with improvements in some cases such as sharper texture preservation and more stable alignment between the garment and the body. (b) In challenging examples involving logos and texts, CatVTON frequently blurs or warps the patterns. Leffa and our method both reproduce these designs well, but Re-CatVTON retains slightly clearer edges and more consistent typography.
  • ...and 5 more figures