Table of Contents
Fetching ...

Style-Instructed Mask-Free Virtual Try On

Mengqi Zhang, Qi Li, Mehmet Saygin Seyfioglu, Karim Bouyarmane

Abstract

Virtual Try-On is a promising research area with broad applications in e-commerce and everyday life, enabling users to visualize garments on themselves or others before purchase. Most existing methods depend on predefined or user-specified masks to guide garment placement, but their performance is highly sensitive to mask quality, often causing misalignment or artifacts, and introduces redundant steps for users. To overcome these limitations, we propose a mask-free virtual try-on framework that requires only minimal modifications to the underlying architecture while remaining compatible with common diffusion-based pipelines. To address the increased ambiguity in the absence of masks, we integrate an attention-based guidance mechanism that explicitly directs the model to focus on the target garment region and improves correspondence between the garment and the person. Additionally, we incorporate instruction prompts, allowing users to flexibly control garment categories and wearing styles, addressing the underutilization of prompts in prior work and improving interaction flexibility. Both qualitative and quantitative evaluations across multiple datasets demonstrate that our approach consistently outperforms existing methods, producing more accurate, robust, and user-friendly try-on results.

Style-Instructed Mask-Free Virtual Try On

Abstract

Virtual Try-On is a promising research area with broad applications in e-commerce and everyday life, enabling users to visualize garments on themselves or others before purchase. Most existing methods depend on predefined or user-specified masks to guide garment placement, but their performance is highly sensitive to mask quality, often causing misalignment or artifacts, and introduces redundant steps for users. To overcome these limitations, we propose a mask-free virtual try-on framework that requires only minimal modifications to the underlying architecture while remaining compatible with common diffusion-based pipelines. To address the increased ambiguity in the absence of masks, we integrate an attention-based guidance mechanism that explicitly directs the model to focus on the target garment region and improves correspondence between the garment and the person. Additionally, we incorporate instruction prompts, allowing users to flexibly control garment categories and wearing styles, addressing the underutilization of prompts in prior work and improving interaction flexibility. Both qualitative and quantitative evaluations across multiple datasets demonstrate that our approach consistently outperforms existing methods, producing more accurate, robust, and user-friendly try-on results.

Paper Structure

This paper contains 21 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: We propose a framework that enables high-fidelity garment transfer and fine-grained stylistic control, maintaining visual consistency across diverse human identities, complex poses, and a broad spectrum of clothing categories.
  • Figure 2: Overview of the proposed Style-Instructed Mask-Free Virtual Try-On (SMF-VTO). Left: Synthetic data generation. A pretrained mask-based VTO model synthesizes a source image by combining a target person image with a sampled alternate outfit, producing source–target pairs that enable triplet-style supervision without manual segmentation. Right: SMF-VTO architecture. Person/garment images are encoded by a VAE, while textual style instructions are encoded by a text encoder; the resulting tokens are fused and processed by DiT blocks. We further apply an attention guidance loss, $L_{\text{attn}}$, to encourage the model’s internal attention to concentrate on garment regions, improving spatial fidelity and controllability in a fully mask-free setting.
  • Figure 3: Text-instructed style control results of SMF-VTO. Given the same source person image and pose, SMF-VTO generates different try-on results guided by natural language prompts describing garment type and wearing style (e.g., "wear this t-shirt tucked in the pants", "try on this dress with elbow-length sleeves"). The results illustrate that SMF-VTO can follow fine-grained instructions while maintaining realistic garment transfer and spatial alignment, without any segmentation masks.
  • Figure 4: Ablation study on the key componentsFrom left to right: source person image, reference garment, baseline Kontext model, Kontext with reference positional embedding (+ Ref Pos Emb), and full model with attention-guided mask loss (+ Attn Mask Loss).The baseline struggles with incomplete garment transfer and spatial misalignment. Adding reference positional embedding improves placement and structure, while incorporating the attention-guided auxiliary loss further enhances garment detail, contour fidelity, and texture realism—demonstrating the complementary effects of both modules.