Table of Contents
Fetching ...

MF-VITON: High-Fidelity Mask-Free Virtual Try-On with Minimal Input

Zhenchen Wan, Yanwu xu, Dongting Hu, Weilun Cheng, Tianxi Chen, Zhaoqing Wang, Feng Liu, Tongliang Liu, Mingming Gong

TL;DR

This work tackles the limitations of mask-dependent Virtual Try-On (VITON) by introducing MF-VITON, a two-stage Mask-Free framework. It first uses a Mask-based VITON to generate a high-quality, diverse training dataset and then fine-tunes a Mask-Free model with an Output-for-Input (OFI) strategy to learn garment transfer from minimal inputs. The architecture combines a Garment Extractor (ReferenceNet + Adapter) with a latent-space denoising network (TryonNet) and employs decoupled cross-attention to fuse garment features with text prompts. Across VITON-HD, DressCode, and In-the-Wild benchmarks, MF-VITON achieves state-of-the-art realism and garment fidelity, while the OFI strategy markedly improves robustness to mask inaccuracies and background variation, enabling practical, real-world VITON applications.

Abstract

Recent advancements in Virtual Try-On (VITON) have significantly improved image realism and garment detail preservation, driven by powerful text-to-image (T2I) diffusion models. However, existing methods often rely on user-provided masks, introducing complexity and performance degradation due to imperfect inputs, as shown in Fig.1(a). To address this, we propose a Mask-Free VITON (MF-VITON) framework that achieves realistic VITON using only a single person image and a target garment, eliminating the requirement for auxiliary masks. Our approach introduces a novel two-stage pipeline: (1) We leverage existing Mask-based VITON models to synthesize a high-quality dataset. This dataset contains diverse, realistic pairs of person images and corresponding garments, augmented with varied backgrounds to mimic real-world scenarios. (2) The pre-trained Mask-based model is fine-tuned on the generated dataset, enabling garment transfer without mask dependencies. This stage simplifies the input requirements while preserving garment texture and shape fidelity. Our framework achieves state-of-the-art (SOTA) performance regarding garment transfer accuracy and visual realism. Notably, the proposed Mask-Free model significantly outperforms existing Mask-based approaches, setting a new benchmark and demonstrating a substantial lead over previous approaches. For more details, visit our project page: https://zhenchenwan.github.io/MF-VITON/.

MF-VITON: High-Fidelity Mask-Free Virtual Try-On with Minimal Input

TL;DR

This work tackles the limitations of mask-dependent Virtual Try-On (VITON) by introducing MF-VITON, a two-stage Mask-Free framework. It first uses a Mask-based VITON to generate a high-quality, diverse training dataset and then fine-tunes a Mask-Free model with an Output-for-Input (OFI) strategy to learn garment transfer from minimal inputs. The architecture combines a Garment Extractor (ReferenceNet + Adapter) with a latent-space denoising network (TryonNet) and employs decoupled cross-attention to fuse garment features with text prompts. Across VITON-HD, DressCode, and In-the-Wild benchmarks, MF-VITON achieves state-of-the-art realism and garment fidelity, while the OFI strategy markedly improves robustness to mask inaccuracies and background variation, enabling practical, real-world VITON applications.

Abstract

Recent advancements in Virtual Try-On (VITON) have significantly improved image realism and garment detail preservation, driven by powerful text-to-image (T2I) diffusion models. However, existing methods often rely on user-provided masks, introducing complexity and performance degradation due to imperfect inputs, as shown in Fig.1(a). To address this, we propose a Mask-Free VITON (MF-VITON) framework that achieves realistic VITON using only a single person image and a target garment, eliminating the requirement for auxiliary masks. Our approach introduces a novel two-stage pipeline: (1) We leverage existing Mask-based VITON models to synthesize a high-quality dataset. This dataset contains diverse, realistic pairs of person images and corresponding garments, augmented with varied backgrounds to mimic real-world scenarios. (2) The pre-trained Mask-based model is fine-tuned on the generated dataset, enabling garment transfer without mask dependencies. This stage simplifies the input requirements while preserving garment texture and shape fidelity. Our framework achieves state-of-the-art (SOTA) performance regarding garment transfer accuracy and visual realism. Notably, the proposed Mask-Free model significantly outperforms existing Mask-based approaches, setting a new benchmark and demonstrating a substantial lead over previous approaches. For more details, visit our project page: https://zhenchenwan.github.io/MF-VITON/.

Paper Structure

This paper contains 15 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: We propose a Mask-Free Virtual Try-On framework that achieves SOTA visual quality by eliminating artifacts caused by inaccurate masks: (a) Eliminate interference from inaccurate masks: Inaccurate masks cause over-masking, leading to unnatural regeneration of hair or hands, and mask leakage, resulting in artifacts like remnants of the old clothing. (b) Demonstration of VITON-HD In-the-Wild.
  • Figure 2: Overview of MF-VITON: We propose a Mask-based & Mask-Free VITON pipeline that enables seamless adaptation from Mask-based to MF-VITON. The pipeline comprises two branches: (a) the Garment Extractor, which leverages ReferenceNet to encode fine-grained garment features $\mathcal{E}(x_{\text{g}})$ and employs an Adapterye_ip-adapter_2023 to extract high-level semantics from garment images $X_g$ using a pretrained image encoder; and (b) the Denoising Network, which utilizes TryonNet as the primary denoising branch to process concatenated inputs of noised latent $X_t$ and selectively integrates either Mask-based conditions ($\mathcal{E}(X_{\text{Masked-Con}})$, Mask-based text prompt) or Mask-Free conditions ($\mathcal{E}(X_{\text{Unmasked-Con}})$, Mask-Free text prompt).
  • Figure 3: Overview of MF-VITON Dataset Generation: (a) In-the-Wild Mask-Free Dataset Generation: Uses FLUX.1-Fill-dev flux2024 to generate realistic background-filled model images $X_{\text{bg}}$, which are then composited with the Mask-based background $bg$ to create Mask-Free dataset samples $X_{\text{Unmasked-Con-bg}}$. (b) Mask-Free Dataset Generation: Concatenates the noised latent encoding $\mathcal{E}(X_{\text{model}})$ with Mask-based conditions $\mathcal{E}(X_{\text{Masked-Cons}})$. The Mask-based VITON model then synthesizes garment-swapped images $X_{\text{Unmasked-Con}}$.
  • Figure 4: The blue dashed box shows Over-masking, causing person appearance inconsistency, while the red dashed box indicates Mask leakage, introducing artifacts. This figure highlights our model’s superior naturalness and realism compared to SOTA approaches on (a) VITON-HD choi_viton-hd_2021 and (b) VITON-HD In-the-Wild. All masks are generated and augmented using VITON-HD choi_viton-hd_2021. Zoom in for finer details.
  • Figure 5: Comparison of virtual try-on results with and without the OFI strategy. The OFI-enhanced model achieves better control over the attention map, leading to more accurate garment placement and fewer artifacts, particularly in challenging regions.