Table of Contents
Fetching ...

MC-VTON: Minimal Control Virtual Try-On Diffusion Transformer

Junsheng Luan, Guangyuan Li, Lei Zhao, Wei Xing

TL;DR

MC-VTON presents a DiT-based virtual try-on approach that achieves high-fidelity garment transfer with minimal conditioning and no extra networks. It employs a two-stage training regime with LoRA-based parameter efficiency and latent-space distillation to reduce inference steps to 8, while maintaining or improving realism. The method achieves superior quantitative and qualitative results on VITON-HD and DressCode with fewer trainable parameters and inputs. This work offers a practical, scalable pathway for real-world VTON applications and lays groundwork for extending to video-based scenarios.

Abstract

Virtual try-on methods based on diffusion models achieve realistic try-on effects. They use an extra reference network or an additional image encoder to process multiple conditional image inputs, which adds complexity pre-processing and additional computational costs. Besides, they require more than 25 inference steps, bringing longer inference time. In this work, with the development of diffusion transformer (DiT), we rethink the necessity of additional reference network or image encoder and introduce MC-VTON, which leverages DiT's intrinsic backbone to seamlessly integrate minimal conditional try-on inputs. Compared to existing methods, the superiority of MC-VTON is demonstrated in four aspects: (1) Superior detail fidelity. Our DiT-based MC-VTON exhibits superior fidelity in preserving fine-grained details. (2) Simplified network and inputs. We remove any extra reference network or image encoder. We also remove unnecessary conditions like the long prompt, pose estimation, human parsing, and depth map. We require only the masked person image and the garment image. (3) Parameter-efficient training. To process the try-on task, we fine-tune the FLUX.1-dev with only 39.7M additional parameters (0.33% of the backbone parameters). (4) Less inference steps. We apply distillation diffusion on MC-VTON and only need 8 steps to generate a realistic try-on image, with only 86.8M additional parameters (0.72% of the backbone parameters). Experiments show that MC-VTON achieves superior qualitative and quantitative results with fewer condition inputs, trainable parameters, and inference steps than baseline methods.

MC-VTON: Minimal Control Virtual Try-On Diffusion Transformer

TL;DR

MC-VTON presents a DiT-based virtual try-on approach that achieves high-fidelity garment transfer with minimal conditioning and no extra networks. It employs a two-stage training regime with LoRA-based parameter efficiency and latent-space distillation to reduce inference steps to 8, while maintaining or improving realism. The method achieves superior quantitative and qualitative results on VITON-HD and DressCode with fewer trainable parameters and inputs. This work offers a practical, scalable pathway for real-world VTON applications and lays groundwork for extending to video-based scenarios.

Abstract

Virtual try-on methods based on diffusion models achieve realistic try-on effects. They use an extra reference network or an additional image encoder to process multiple conditional image inputs, which adds complexity pre-processing and additional computational costs. Besides, they require more than 25 inference steps, bringing longer inference time. In this work, with the development of diffusion transformer (DiT), we rethink the necessity of additional reference network or image encoder and introduce MC-VTON, which leverages DiT's intrinsic backbone to seamlessly integrate minimal conditional try-on inputs. Compared to existing methods, the superiority of MC-VTON is demonstrated in four aspects: (1) Superior detail fidelity. Our DiT-based MC-VTON exhibits superior fidelity in preserving fine-grained details. (2) Simplified network and inputs. We remove any extra reference network or image encoder. We also remove unnecessary conditions like the long prompt, pose estimation, human parsing, and depth map. We require only the masked person image and the garment image. (3) Parameter-efficient training. To process the try-on task, we fine-tune the FLUX.1-dev with only 39.7M additional parameters (0.33% of the backbone parameters). (4) Less inference steps. We apply distillation diffusion on MC-VTON and only need 8 steps to generate a realistic try-on image, with only 86.8M additional parameters (0.72% of the backbone parameters). Experiments show that MC-VTON achieves superior qualitative and quantitative results with fewer condition inputs, trainable parameters, and inference steps than baseline methods.
Paper Structure (24 sections, 5 equations, 6 figures, 4 tables)

This paper contains 24 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: MC-VTON achieves superior performance on virtual try-on tasks with fine-grained detail preservation, simplified network and inputs, less training parameters, and less inference steps.
  • Figure 2: Efficiency comparison with try-on methods. Each method is represented by two concentric circles. The outer circle denotes the total parameters and the inner circle denotes the trainable parameters. MC-VTON achieves lower FID on the VITONHD dataset with fewer trainable parameters and inference steps.
  • Figure 3: The architecture of our proposed MC-VTON. Our model is trained in two stages. The upper-left LoRA-switch controls the statuses of three types of LoRA: G-LoRA, MP-LoRA, and Distill-LoRA. Stage-1 trains the try-on ability. G-LoRA and MP-LoRA are trained and Distill-LoRA is disabled. Stage-2 applies distillation diffusion. The G-LoRA and MP-LoRA are frozen. The trained MC-VTON acts as the teacher when Distill-LoRA is disabled, and acts as the student when Distill-LoRA is training.
  • Figure 4: Qualitative comparison of our proposed MC-VTON with other methods on VITON-HD dataset. Best viewed when zoomed in.
  • Figure 5: Additional qualitative comparison with IDM-VTON and FitDiT. The upper group is conducted on DressCode dataset and the lower group is the in-the-wild comparison. Best viewed when zoomed in.
  • ...and 1 more figures