MF-VITON: High-Fidelity Mask-Free Virtual Try-On with Minimal Input
Zhenchen Wan, Yanwu xu, Dongting Hu, Weilun Cheng, Tianxi Chen, Zhaoqing Wang, Feng Liu, Tongliang Liu, Mingming Gong
TL;DR
This work tackles the limitations of mask-dependent Virtual Try-On (VITON) by introducing MF-VITON, a two-stage Mask-Free framework. It first uses a Mask-based VITON to generate a high-quality, diverse training dataset and then fine-tunes a Mask-Free model with an Output-for-Input (OFI) strategy to learn garment transfer from minimal inputs. The architecture combines a Garment Extractor (ReferenceNet + Adapter) with a latent-space denoising network (TryonNet) and employs decoupled cross-attention to fuse garment features with text prompts. Across VITON-HD, DressCode, and In-the-Wild benchmarks, MF-VITON achieves state-of-the-art realism and garment fidelity, while the OFI strategy markedly improves robustness to mask inaccuracies and background variation, enabling practical, real-world VITON applications.
Abstract
Recent advancements in Virtual Try-On (VITON) have significantly improved image realism and garment detail preservation, driven by powerful text-to-image (T2I) diffusion models. However, existing methods often rely on user-provided masks, introducing complexity and performance degradation due to imperfect inputs, as shown in Fig.1(a). To address this, we propose a Mask-Free VITON (MF-VITON) framework that achieves realistic VITON using only a single person image and a target garment, eliminating the requirement for auxiliary masks. Our approach introduces a novel two-stage pipeline: (1) We leverage existing Mask-based VITON models to synthesize a high-quality dataset. This dataset contains diverse, realistic pairs of person images and corresponding garments, augmented with varied backgrounds to mimic real-world scenarios. (2) The pre-trained Mask-based model is fine-tuned on the generated dataset, enabling garment transfer without mask dependencies. This stage simplifies the input requirements while preserving garment texture and shape fidelity. Our framework achieves state-of-the-art (SOTA) performance regarding garment transfer accuracy and visual realism. Notably, the proposed Mask-Free model significantly outperforms existing Mask-based approaches, setting a new benchmark and demonstrating a substantial lead over previous approaches. For more details, visit our project page: https://zhenchenwan.github.io/MF-VITON/.
