Table of Contents
Fetching ...

DS-VTON: An Enhanced Dual-Scale Coarse-to-Fine Framework for Virtual Try-On

Xianbing Sun, Yan Hong, Jiahui Zhan, Jun Lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang

TL;DR

DS-VTON tackles the dual challenges of garment-body alignment and texture fidelity in virtual try-on by introducing a dual-scale coarse-to-fine diffusion framework. It decouples structure guidance in a low-resolution stage from high-resolution texture refinement via a novel blend-refine diffusion that connects two complex distributions, using a mask-free training regime and a dual-U-Net backbone. The approach achieves state-of-the-art results on VITON-HD and DressCode, with robust qualitative and quantitative improvements and strong ablations supporting the design choices. Its mask-free, scalable framework offers practical benefits for e-commerce while reducing reliance on segmentation masks, and it has potential for extension to higher resolutions and related image synthesis tasks.

Abstract

Despite recent progress, most existing virtual try-on methods still struggle to simultaneously address two core challenges: accurately aligning the garment image with the target human body, and preserving fine-grained garment textures and patterns. These two requirements map directly onto a coarse-to-fine generation paradigm, where the coarse stage handles structural alignment and the fine stage recovers rich garment details. Motivated by this observation, we propose DS-VTON, an enhanced dual-scale coarse-to-fine framework that tackles the try-on problem more effectively. DS-VTON consists of two stages: the first stage generates a low-resolution try-on result to capture the semantic correspondence between garment and body, where reduced detail facilitates robust structural alignment. In the second stage, a blend-refine diffusion process reconstructs high-resolution outputs by refining the residual between scales through noise-image blending, emphasizing texture fidelity and effectively correcting fine-detail errors from the low-resolution stage. In addition, our method adopts a fully mask-free generation strategy, eliminating reliance on human parsing maps or segmentation masks. Extensive experiments show that DS-VTON not only achieves state-of-the-art performance but consistently and significantly surpasses prior methods in both structural alignment and texture fidelity across multiple standard virtual try-on benchmarks.

DS-VTON: An Enhanced Dual-Scale Coarse-to-Fine Framework for Virtual Try-On

TL;DR

DS-VTON tackles the dual challenges of garment-body alignment and texture fidelity in virtual try-on by introducing a dual-scale coarse-to-fine diffusion framework. It decouples structure guidance in a low-resolution stage from high-resolution texture refinement via a novel blend-refine diffusion that connects two complex distributions, using a mask-free training regime and a dual-U-Net backbone. The approach achieves state-of-the-art results on VITON-HD and DressCode, with robust qualitative and quantitative improvements and strong ablations supporting the design choices. Its mask-free, scalable framework offers practical benefits for e-commerce while reducing reliance on segmentation masks, and it has potential for extension to higher resolutions and related image synthesis tasks.

Abstract

Despite recent progress, most existing virtual try-on methods still struggle to simultaneously address two core challenges: accurately aligning the garment image with the target human body, and preserving fine-grained garment textures and patterns. These two requirements map directly onto a coarse-to-fine generation paradigm, where the coarse stage handles structural alignment and the fine stage recovers rich garment details. Motivated by this observation, we propose DS-VTON, an enhanced dual-scale coarse-to-fine framework that tackles the try-on problem more effectively. DS-VTON consists of two stages: the first stage generates a low-resolution try-on result to capture the semantic correspondence between garment and body, where reduced detail facilitates robust structural alignment. In the second stage, a blend-refine diffusion process reconstructs high-resolution outputs by refining the residual between scales through noise-image blending, emphasizing texture fidelity and effectively correcting fine-detail errors from the low-resolution stage. In addition, our method adopts a fully mask-free generation strategy, eliminating reliance on human parsing maps or segmentation masks. Extensive experiments show that DS-VTON not only achieves state-of-the-art performance but consistently and significantly surpasses prior methods in both structural alignment and texture fidelity across multiple standard virtual try-on benchmarks.

Paper Structure

This paper contains 43 sections, 3 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: (a) DS-VTON results across diverse scenarios. (b) Existing methods kim2024stablevitonxu2025ootdiffusionchoi2025improvingzhou2025flowattention adopt a single-scale pipeline with masked inputs, limiting their ability to capture full-body semantics and garment structure. (c) In contrast, DS-VTON adopts an enhanced dual-scale coarse-to-fine framework combined with a mask-free strategy.
  • Figure 2: Upper panel: Two-scale generation pipeline. A low-resolution stage produces a coarse try-on result, then refined by a high-resolution stage; both stages share the same network architecture (see Section \ref{['sec:method']}). Lower panel: Results with different settings; ours uses $\sigma = 2$ and $\alpha = \beta = \tfrac{1}{2}$(see Subsections \ref{['subsec:low_res']} and \ref{['subsec:high_res']}). With proper two-stage settings, the second stage leverages the reliable coarse structure from the first stage to correct fine-detail errors and generate high-quality try-on results.
  • Figure 3: Qualitative comparison on the VITON-HD dataset. DS-VTON(LR) denotes the low-resolution result, and DS-VTON(HR) represents the final high-resolution result.
  • Figure 4: Visualized results under varying downsampling ratios $\sigma$.
  • Figure 5: Visualized results under different $\mathbf{x}_T$ initialization settings ($\mathbf{x}_T = \alpha \cdot \boldsymbol{\epsilon} + \beta \cdot \tilde{\mathbf{x}}_r$).
  • ...and 7 more figures