Table of Contents
Fetching ...

M&M VTO: Multi-Garment Virtual Try-On and Editing

Luyang Zhu, Yingwei Li, Nan Liu, Hao Peng, Dawei Yang, Ira Kemelmacher-Shlizerman

TL;DR

M&M VTO addresses multi-garment virtual try-on with a unified, single-stage diffusion framework that directly synthesizes $1024\times512$ imagery. It introduces VTO-UDiT to disentangle person identity from denoising and employs a space-efficientFinetuning strategy that updates only person features, reducing per-subject parameters to $\sim$6 MB. Layout control is achieved via PaLI-3–based text attributes, enabling language-guided editing of garment arrangements. Empirical results show state-of-the-art performance in quality, identity preservation, and layout editing, with strong user preference and robust cross-garment synthesis, expanding practical VTO applications.

Abstract

We present M&M VTO, a mix and match virtual try-on method that takes as input multiple garment images, text description for garment layout and an image of a person. An example input includes: an image of a shirt, an image of a pair of pants, "rolled sleeves, shirt tucked in", and an image of a person. The output is a visualization of how those garments (in the desired layout) would look like on the given person. Key contributions of our method are: 1) a single stage diffusion based model, with no super resolution cascading, that allows to mix and match multiple garments at 1024x512 resolution preserving and warping intricate garment details, 2) architecture design (VTO UNet Diffusion Transformer) to disentangle denoising from person specific features, allowing for a highly effective finetuning strategy for identity preservation (6MB model per individual vs 4GB achieved with, e.g., dreambooth finetuning); solving a common identity loss problem in current virtual try-on methods, 3) layout control for multiple garments via text inputs specifically finetuned over PaLI-3 for virtual try-on task. Experimental results indicate that M&M VTO achieves state-of-the-art performance both qualitatively and quantitatively, as well as opens up new opportunities for virtual try-on via language-guided and multi-garment try-on.

M&M VTO: Multi-Garment Virtual Try-On and Editing

TL;DR

M&M VTO addresses multi-garment virtual try-on with a unified, single-stage diffusion framework that directly synthesizes imagery. It introduces VTO-UDiT to disentangle person identity from denoising and employs a space-efficientFinetuning strategy that updates only person features, reducing per-subject parameters to 6 MB. Layout control is achieved via PaLI-3–based text attributes, enabling language-guided editing of garment arrangements. Empirical results show state-of-the-art performance in quality, identity preservation, and layout editing, with strong user preference and robust cross-garment synthesis, expanding practical VTO applications.

Abstract

We present M&M VTO, a mix and match virtual try-on method that takes as input multiple garment images, text description for garment layout and an image of a person. An example input includes: an image of a shirt, an image of a pair of pants, "rolled sleeves, shirt tucked in", and an image of a person. The output is a visualization of how those garments (in the desired layout) would look like on the given person. Key contributions of our method are: 1) a single stage diffusion based model, with no super resolution cascading, that allows to mix and match multiple garments at 1024x512 resolution preserving and warping intricate garment details, 2) architecture design (VTO UNet Diffusion Transformer) to disentangle denoising from person specific features, allowing for a highly effective finetuning strategy for identity preservation (6MB model per individual vs 4GB achieved with, e.g., dreambooth finetuning); solving a common identity loss problem in current virtual try-on methods, 3) layout control for multiple garments via text inputs specifically finetuned over PaLI-3 for virtual try-on task. Experimental results indicate that M&M VTO achieves state-of-the-art performance both qualitatively and quantitatively, as well as opens up new opportunities for virtual try-on via language-guided and multi-garment try-on.
Paper Structure (20 sections, 1 equation, 29 figures, 6 tables)

This paper contains 20 sections, 1 equation, 29 figures, 6 tables.

Figures (29)

  • Figure 1: Given an input person image, multiple garments, M&M VTO can output a virtual try-on visualization of how those garments would look on the person. Our model performs well across various body shapes, poses, and garments. In addition, it allows layout to be changed, e.g., "roll up the sleeves" (top rightmost column), and "tuck in the shirt and roll down the sleeves" (bottom rightmost column).
  • Figure 2: Overview of M&M VTO.Left: Given multiple garments (top and bottom in this case, full-body garment not shown for this example), layout description, and a person image, our method enables multi-garment virtual try-on. Right: By freezing all the parameters, we optimize person feature embeddings extracted from the person encoder to improve person identity for a specific input image. The fine-tuning process recovers the information lost via agnostic computation.
  • Figure 3: VTO-UDiT architecture. For image inputs, UNet encoders ($\mathbf{E}_{\mathbf{z}_t}$, $\mathbf{E}_{p}$, $\mathbf{E}_{g}$) extract features maps ($\mathcal{F}_{\mathbf{z}_t}$, $\mathcal{F}_{p}$, $\mathcal{F}_{g}^{\kappa}$) from $\mathbf{z}_t$, $I_{a}$, $I_{c}^{\kappa}$, respectively, with $\kappa \in \{\text{upper}, \text{lower}, \text{full}\}$. Diffusion timestep $t$ and garment attributes $y_{\text{gl}}$ are embedded with sinusoidal positional encoding, followed by a linear layer. The embeddings ($\mathcal{F}_{t}$ and $\mathcal{F}_{y_{\text{gl}}}$) are then used to modulate features with FiLM dumoulin2018feature or concatenated to the key-value feature of self-attention in DiT similar to saharia2022photorealistic. Following Zhu_2023_CVPR_tryondiffusion, spatially aligned features($\mathcal{F}_{\mathbf{z}_t}$, $\mathcal{F}_{p}$) are concatenated whereas $\mathcal{F}_{g}^{\kappa}$ are implicitly warped with cross-attention blocks. The final denoised image $\hat{\mathbf{x}}_0$ is obtained with decoder $\mathbf{D}_{\mathbf{z}_t}$, which is architecturally symmetrical to $\mathbf{E}_{\mathbf{z}_t}$.
  • Figure 4: Qualitative Comparison with existing Try-On methods. On the left, we compare with TryOnDiffusion Zhu_2023_CVPR_tryondiffusion on our test set and further evaluate on DressCode morelli2022dress dataset, as shown on the right. Our method can generate better garment details and layouts.
  • Figure 5: Qualitative Comparison for Garment Layout Editing. Top: editing instruction is to "tuck out the shirt". Bottom: "roll down the sleeve". Our method enables more accurate layout editing while preserving the details from the inputs. Details are provided in the Supplementary.
  • ...and 24 more figures