Table of Contents
Fetching ...

Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories

Junyao Hu, Zhongwei Cheng, Waikeung Wong, Xingxing Zou

Abstract

Virtual try-on (VTON) has advanced single-garment visualization, yet real-world fashion centers on full outfits with multiple garments, accessories, fine-grained categories, layering, and diverse styling, remaining beyond current VTON systems. Existing datasets are category-limited and lack outfit diversity. We introduce Garments2Look, the first large-scale multimodal dataset for outfit-level VTON, comprising 80K many-garments-to-one-look pairs across 40 major categories and 300+ fine-grained subcategories. Each pair includes an outfit with 3-12 reference garment images (Average 4.48), a model image wearing the outfit, and detailed item and try-on textual annotations. To balance authenticity and diversity, we propose a synthesis pipeline. It involves heuristically constructing outfit lists before generating try-on results, with the entire process subjected to strict automated filtering and human validation to ensure data quality. To probe task difficulty, we adapt SOTA VTON methods and general-purpose image editing models to establish baselines. Results show current methods struggle to try on complete outfits seamlessly and to infer correct layering and styling, leading to misalignment and artifacts.

Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories

Abstract

Virtual try-on (VTON) has advanced single-garment visualization, yet real-world fashion centers on full outfits with multiple garments, accessories, fine-grained categories, layering, and diverse styling, remaining beyond current VTON systems. Existing datasets are category-limited and lack outfit diversity. We introduce Garments2Look, the first large-scale multimodal dataset for outfit-level VTON, comprising 80K many-garments-to-one-look pairs across 40 major categories and 300+ fine-grained subcategories. Each pair includes an outfit with 3-12 reference garment images (Average 4.48), a model image wearing the outfit, and detailed item and try-on textual annotations. To balance authenticity and diversity, we propose a synthesis pipeline. It involves heuristically constructing outfit lists before generating try-on results, with the entire process subjected to strict automated filtering and human validation to ensure data quality. To probe task difficulty, we adapt SOTA VTON methods and general-purpose image editing models to establish baselines. Results show current methods struggle to try on complete outfits seamlessly and to infer correct layering and styling, leading to misalignment and artifacts.
Paper Structure (29 sections, 20 figures, 8 tables)

This paper contains 29 sections, 20 figures, 8 tables.

Figures (20)

  • Figure 1: Comparison of data formats in virtual try-on datasets. Outfit-level dataset is collected and generated from a large-scale set of real images, each paired with diverse clothing and accessories, and including the information of outfit layering and styling.
  • Figure 2: Overview of Garments2Look construction process.
  • Figure 3: Data distribution statistics for Garments2Look.
  • Figure 4: Comparison of results of 3 SOTA VTON models and 4 general-purpose image editing models on 4 real representative examples from Garments2Look test set. QIE-2509 = Qwen-Image-Edit-2509, NB = Nano Banana, N Ref = Using a model image and multiple single garment images as input, 2 Ref = Using a model image and an OOTD image as input. A yellow box denotes the difference with the look image (GT). A black arrow indicates a distinct artifact boundary. Row 1: 4 items, 1 layer, no accessory. Row 2: 5 items, 2 layers, no accessory. Row 3: 8 items, 3 layers, 2 accessories. Row 4: 9 items, 3 layers, 3 accessories.
  • Figure 5: Garment consistency with respect to the number of reference garment images, group by different types of model.
  • ...and 15 more figures