Table of Contents
Fetching ...

UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment

Wei Zhang, Yeying Jin, Xin Li, Yan Zhang, Xiaofeng Cong, Cong Wang, Fengcai Qiao, zhichao Lian

TL;DR

UniFit tackles universal image-based virtual try-on by bridging the semantic gap between textual instructions and reference images with an MLLM-guided semantic alignment module (MGSA) and a two-stage progressive training regime that uses self-synthesis to scale to complex tasks. It fuses MGSA with a Diffusion Transformer and a VAE encoder, guided by a semantic alignment loss $\mathcal{L}_{\text{align}}$ and a spatial attention focusing loss $\mathcal{L}_{\text{focus}}$, while progressively expanding capabilities from single-garment to multi-garment and model-to-model scenarios. The approach achieves state-of-the-art results across six VTON tasks on diverse datasets, while maintaining efficiency, and provides public code and pretrained models for reproducibility. The work advances practical universal VTON by enabling flexible, text-driven control over intricate garment transfers and multi-view/ multi-garment configurations, with broad implications for e-commerce and digital fashion applications.

Abstract

Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit semantic guidance for the generative process, thereby reducing the semantic gap. Moreover, by devising a two-stage progressive training strategy with a self-synthesis pipeline, UniFit is able to learn complex tasks from limited data. Extensive experiments show that UniFit not only supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance. The source code and pretrained models are available at https://github.com/zwplus/UniFit.

UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment

TL;DR

UniFit tackles universal image-based virtual try-on by bridging the semantic gap between textual instructions and reference images with an MLLM-guided semantic alignment module (MGSA) and a two-stage progressive training regime that uses self-synthesis to scale to complex tasks. It fuses MGSA with a Diffusion Transformer and a VAE encoder, guided by a semantic alignment loss and a spatial attention focusing loss , while progressively expanding capabilities from single-garment to multi-garment and model-to-model scenarios. The approach achieves state-of-the-art results across six VTON tasks on diverse datasets, while maintaining efficiency, and provides public code and pretrained models for reproducibility. The work advances practical universal VTON by enabling flexible, text-driven control over intricate garment transfers and multi-view/ multi-garment configurations, with broad implications for e-commerce and digital fashion applications.

Abstract

Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit semantic guidance for the generative process, thereby reducing the semantic gap. Moreover, by devising a two-stage progressive training strategy with a self-synthesis pipeline, UniFit is able to learn complex tasks from limited data. Extensive experiments show that UniFit not only supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance. The source code and pretrained models are available at https://github.com/zwplus/UniFit.

Paper Structure

This paper contains 43 sections, 4 equations, 19 figures, 8 tables.

Figures (19)

  • Figure 1: Motivation of UniFit: (a) Existing instruction-guided VTON methods process text and images separately, resulting in a semantic gap. (b) Our UniFit introduces an MLLM-Guided Semantic Alignment Module (MGSA), which integrates textual and visual inputs to produce coherent and explicit semantic guidance for the generative model, effectively bridging the semantic gap.
  • Figure 2: Overview of UniFit. UniFit consists of three main components: the MGSA module (red), the DiT (gray), and the VAE encoder (blue). The MGSA encodes multimodal inputs into coherent semantic guidance. The VAE extracts low-level visual features from reference images. The DiT generates the output image conditioned on the semantic guidance and low-level visual features. Additionally, a spatial attention focusing loss (green) supervises the attention maps of the DiT, encouraging the model to focus on the most task-relevant regions (e.g., the try-on area in single-garment try-on task).
  • Figure 3: Towards Universal VTON: Two-Stage Progressive Training Strategy of UniFit with Self-Synthesis. (a) Stage I: A Base Model is trained on foundational tasks using public datasets. (b) Self-Synthesis: The trained model is then used to generate pseudo-paired data for complex scenarios. (b1) For multi-garment try-on, we reconstruct garments from full-body images. (b2) For model-to-model try-on, we synthesize new person images conditioned on given garments. (c) Stage II: The model is fine-tuned on a composite dataset of both real and synthesized samples, enabling generalization to a wide range of VTON tasks.
  • Figure 4: Qualitative comparison of garment reconstruction.
  • Figure 5: Qualitative comparison of single-garment try-on.
  • ...and 14 more figures