Table of Contents
Fetching ...

MFP-VTON: Enhancing Mask-Free Person-to-Person Virtual Try-On via Diffusion Transformer

Le Shen, Yanting Kang, Rong Huang, Zhijie Wang

TL;DR

The paper tackles the scarcity of person-to-person virtual try-on data by proposing MFP-VTON, a mask-free diffusion-transformer framework that uses concatenated inputs of reference and target persons and a blank region for inpainting. It introduces a Focus Attention Loss to steer the model’s attention toward the reference garment and the target’s exterior, improving fidelity and detail preservation. A custom dataset is created by swapping garments between paired garment-to-person images, enabling effective training and evaluation on both person-to-person and garment-to-person tasks. Experiments show state-of-the-art or competitive performance across multiple metrics, with high-fidelity fitting outputs and preserved poses, demonstrating the practical viability of mask-free person-to-person VTON.

Abstract

The garment-to-person virtual try-on (VTON) task, which aims to generate fitting images of a person wearing a reference garment, has made significant strides. However, obtaining a standard garment is often more challenging than using the garment already worn by the person. To improve ease of use, we propose MFP-VTON, a Mask-Free framework for Person-to-Person VTON. Recognizing the scarcity of person-to-person data, we adapt a garment-to-person model and dataset to construct a specialized dataset for this task. Our approach builds upon a pretrained diffusion transformer, leveraging its strong generative capabilities. During mask-free model fine-tuning, we introduce a Focus Attention loss to emphasize the garment of the reference person and the details outside the garment of the target person. Experimental results demonstrate that our model excels in both person-to-person and garment-to-person VTON tasks, generating high-fidelity fitting images.

MFP-VTON: Enhancing Mask-Free Person-to-Person Virtual Try-On via Diffusion Transformer

TL;DR

The paper tackles the scarcity of person-to-person virtual try-on data by proposing MFP-VTON, a mask-free diffusion-transformer framework that uses concatenated inputs of reference and target persons and a blank region for inpainting. It introduces a Focus Attention Loss to steer the model’s attention toward the reference garment and the target’s exterior, improving fidelity and detail preservation. A custom dataset is created by swapping garments between paired garment-to-person images, enabling effective training and evaluation on both person-to-person and garment-to-person tasks. Experiments show state-of-the-art or competitive performance across multiple metrics, with high-fidelity fitting outputs and preserved poses, demonstrating the practical viability of mask-free person-to-person VTON.

Abstract

The garment-to-person virtual try-on (VTON) task, which aims to generate fitting images of a person wearing a reference garment, has made significant strides. However, obtaining a standard garment is often more challenging than using the garment already worn by the person. To improve ease of use, we propose MFP-VTON, a Mask-Free framework for Person-to-Person VTON. Recognizing the scarcity of person-to-person data, we adapt a garment-to-person model and dataset to construct a specialized dataset for this task. Our approach builds upon a pretrained diffusion transformer, leveraging its strong generative capabilities. During mask-free model fine-tuning, we introduce a Focus Attention loss to emphasize the garment of the reference person and the details outside the garment of the target person. Experimental results demonstrate that our model excels in both person-to-person and garment-to-person VTON tasks, generating high-fidelity fitting images.

Paper Structure

This paper contains 11 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Person-to-person and garment-to-person try-on outcomes generated by our MFP-VTON.
  • Figure 2: Overview of our proposed method. The upper part illustrates the data preparation process for the person-to-person task. The lower part demonstrates the training and inference pipelines.
  • Figure 3: Qualitative comparison. The first two columns show the inputs to different models. In the person-to-person task, the three garment-to-person methods rely on segmentation and try-off techniques to obtain the garment on the reference person. In contrast, our method directly generates the outputs based on the reference person.