Table of Contents
Fetching ...

MMTryon: Multi-Modal Multi-Reference Control for High-Quality Fashion Generation

Xujie Zhang, Ente Lin, Xiu Li, Yuxuan Luo, Michael Kampffmeyer, Xin Dong, Xiaodan Liang

TL;DR

MMTryon tackles multi-item, style-controllable virtual try-on without relying on segmentation. It introduces a diffusion-based framework conditioned on a source image, multiple garment references, and text instructions, powered by a pretrained garment encoder and two specialized attention modules. A scalable data generation pipeline provides multi-reference, multi-modal training data, enabling parsing-free inference. Across high-resolution benchmarks and in-the-wild scenes, MMTryon achieves state-of-the-art results and demonstrates robust transfer to diverse outfits and sources.

Abstract

This paper introduces MMTryon, a multi-modal multi-reference VIrtual Try-ON (VITON) framework, which can generate high-quality compositional try-on results by taking a text instruction and multiple garment images as inputs. Our MMTryon addresses three problems overlooked in prior literature: 1) Support of multiple try-on items. Existing methods are commonly designed for single-item try-on tasks (e.g., upper/lower garments, dresses). 2)Specification of dressing style. Existing methods are unable to customize dressing styles based on instructions (e.g., zipped/unzipped, tuck-in/tuck-out, etc.) 3) Segmentation Dependency. They further heavily rely on category-specific segmentation models to identify the replacement regions, with segmentation errors directly leading to significant artifacts in the try-on results. To address the first two issues, our MMTryon introduces a novel multi-modality and multi-reference attention mechanism to combine the garment information from reference images and dressing-style information from text instructions. Besides, to remove the segmentation dependency, MMTryon uses a parsing-free garment encoder and leverages a novel scalable data generation pipeline to convert existing VITON datasets to a form that allows MMTryon to be trained without requiring any explicit segmentation. Extensive experiments on high-resolution benchmarks and in-the-wild test sets demonstrate MMTryon's superiority over existing SOTA methods both qualitatively and quantitatively. MMTryon's impressive performance on multi-item and style-controllable virtual try-on scenarios and its ability to try on any outfit in a large variety of scenarios from any source image, opens up a new avenue for future investigation in the fashion community.

MMTryon: Multi-Modal Multi-Reference Control for High-Quality Fashion Generation

TL;DR

MMTryon tackles multi-item, style-controllable virtual try-on without relying on segmentation. It introduces a diffusion-based framework conditioned on a source image, multiple garment references, and text instructions, powered by a pretrained garment encoder and two specialized attention modules. A scalable data generation pipeline provides multi-reference, multi-modal training data, enabling parsing-free inference. Across high-resolution benchmarks and in-the-wild scenes, MMTryon achieves state-of-the-art results and demonstrates robust transfer to diverse outfits and sources.

Abstract

This paper introduces MMTryon, a multi-modal multi-reference VIrtual Try-ON (VITON) framework, which can generate high-quality compositional try-on results by taking a text instruction and multiple garment images as inputs. Our MMTryon addresses three problems overlooked in prior literature: 1) Support of multiple try-on items. Existing methods are commonly designed for single-item try-on tasks (e.g., upper/lower garments, dresses). 2)Specification of dressing style. Existing methods are unable to customize dressing styles based on instructions (e.g., zipped/unzipped, tuck-in/tuck-out, etc.) 3) Segmentation Dependency. They further heavily rely on category-specific segmentation models to identify the replacement regions, with segmentation errors directly leading to significant artifacts in the try-on results. To address the first two issues, our MMTryon introduces a novel multi-modality and multi-reference attention mechanism to combine the garment information from reference images and dressing-style information from text instructions. Besides, to remove the segmentation dependency, MMTryon uses a parsing-free garment encoder and leverages a novel scalable data generation pipeline to convert existing VITON datasets to a form that allows MMTryon to be trained without requiring any explicit segmentation. Extensive experiments on high-resolution benchmarks and in-the-wild test sets demonstrate MMTryon's superiority over existing SOTA methods both qualitatively and quantitatively. MMTryon's impressive performance on multi-item and style-controllable virtual try-on scenarios and its ability to try on any outfit in a large variety of scenarios from any source image, opens up a new avenue for future investigation in the fashion community.
Paper Structure (23 sections, 5 equations, 12 figures, 2 tables)

This paper contains 23 sections, 5 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: MMTryon can follow complex instructions to generate High-quality try-on results.
  • Figure 2: Overview of the proposed MMTryon framework. The instruction prompt and garment images are combined to obtain a multi-modal instruct embedding, replacing the original textual condition. Each garment image together with the corresponding text span are further processed by the garment encoder to obtain reference features, which, along with target features, undergo multi-reference attention to ensure detailed texture transfer.
  • Figure 3: Overview of the proposed pretrained garment encoder.Our garment encoder utilizes a prior mask derived from grouding dino and SAM to improve text query accuracy through cross-attention between the target text and the input features. The garment encoder is supervised by the diffusion reconstruction loss and our text query loss.
  • Figure 4: The data generation pipeline of MMTryon. We use a large multi-modal model to describe the target person image, followed by open-vocabulary grounding and segmentation models to extract correspondences between a person image and several garment subjects. For each subject, we utilize SDXL inpainting to obtain the enhanced dataset, which serves as our training data.
  • Figure 5: Qualitative comparisons on VITON-HD in the single try-on task. Compared with other methods, our method MMTryon produces more realistic and texture-consistent images.
  • ...and 7 more figures