Rethinking Garment Conditioning in Diffusion-based Virtual Try-On
Kihyun Na, Jinyoung Choi, Injung Kim
TL;DR
Rethinking Garment Conditioning in Diffusion-based Virtual Try-On identifies the substantial computational cost of dual UNet VTON approaches and proposes Re-CatVTON, a single-UNet model guided by hypotheses on how to learn timesteps-aligned, garment-context features. The authors implement full UNet fine-tuning, timestep-aware conditioning, garment-region loss exclusion, improved classifier-free guidance, and ground-truth latent injection to mitigate error propagation, achieving state-of-the-art-like performance on VITON-HD and competitive results on DressCode with significantly lower compute than dual UNet models. Through visualization and theoretical analysis, the work shows that careful conditioning design can close much of the fidelity gap between single- and dual-UNet VTONs, enabling practical deployment. The approach yields strong improvements in FID, KID, and LPIPS with only a marginal SSIM trade-off, offering a practical efficiency–performance balance for diffusion-based virtual try-on systems.
Abstract
Virtual Try-On (VTON) is the task of synthesizing an image of a person wearing a target garment, conditioned on a person image and a garment image. While diffusion-based VTON models featuring a Dual UNet architecture demonstrate superior fidelity compared to single UNet models, they incur substantial computational and memory overhead due to their heavy structure. In this study, through visualization analysis and theoretical analysis, we derived three hypotheses regarding the learning of context features to condition the denoising process. Based on these hypotheses, we developed Re-CatVTON, an efficient single UNet model that achieves high performance. We further enhance the model by introducing a modified classifier-free guidance strategy tailored for VTON's spatial concatenation conditioning, and by directly injecting the ground-truth garment latent derived from the clean garment latent to prevent the accumulation of prediction error. The proposed Re-CatVTON significantly improves performance compared to its predecessor (CatVTON) and requires less computation and memory than the high-performance Dual UNet model, Leffa. Our results demonstrate improved FID, KID, and LPIPS scores, with only a marginal decrease in SSIM, establishing a new efficiency-performance trade-off for single UNet VTON models.
