DS-VTON: An Enhanced Dual-Scale Coarse-to-Fine Framework for Virtual Try-On
Xianbing Sun, Yan Hong, Jiahui Zhan, Jun Lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang
TL;DR
DS-VTON tackles the dual challenges of garment-body alignment and texture fidelity in virtual try-on by introducing a dual-scale coarse-to-fine diffusion framework. It decouples structure guidance in a low-resolution stage from high-resolution texture refinement via a novel blend-refine diffusion that connects two complex distributions, using a mask-free training regime and a dual-U-Net backbone. The approach achieves state-of-the-art results on VITON-HD and DressCode, with robust qualitative and quantitative improvements and strong ablations supporting the design choices. Its mask-free, scalable framework offers practical benefits for e-commerce while reducing reliance on segmentation masks, and it has potential for extension to higher resolutions and related image synthesis tasks.
Abstract
Despite recent progress, most existing virtual try-on methods still struggle to simultaneously address two core challenges: accurately aligning the garment image with the target human body, and preserving fine-grained garment textures and patterns. These two requirements map directly onto a coarse-to-fine generation paradigm, where the coarse stage handles structural alignment and the fine stage recovers rich garment details. Motivated by this observation, we propose DS-VTON, an enhanced dual-scale coarse-to-fine framework that tackles the try-on problem more effectively. DS-VTON consists of two stages: the first stage generates a low-resolution try-on result to capture the semantic correspondence between garment and body, where reduced detail facilitates robust structural alignment. In the second stage, a blend-refine diffusion process reconstructs high-resolution outputs by refining the residual between scales through noise-image blending, emphasizing texture fidelity and effectively correcting fine-detail errors from the low-resolution stage. In addition, our method adopts a fully mask-free generation strategy, eliminating reliance on human parsing maps or segmentation masks. Extensive experiments show that DS-VTON not only achieves state-of-the-art performance but consistently and significantly surpasses prior methods in both structural alignment and texture fidelity across multiple standard virtual try-on benchmarks.
