Table of Contents
Fetching ...

MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data

Hunor Laczkó, Libang Jia, Loc-Phat Truong, Diego Hernández, Sergio Escalera, Jordi Gonzalez, Meysam Madadi

TL;DR

MV-Fashion is introduced, a large-scale, multi-view video dataset engineered for domain-specific fashion analysis and leveraged to establish baselines for fashion-centric tasks, including virtual try-on, clothing size estimation, and novel view synthesis.

Abstract

Existing 4D human datasets fall short for fashion-specific research, lacking either realistic garment dynamics or task-specific annotations. Synthetic datasets suffer from a realism gap, whereas real-world captures lack the detailed annotations and paired data required for virtual try-on (VTON) and size estimation tasks. To bridge this gap, we introduce MV-Fashion, a large-scale, multi-view video dataset engineered for domain-specific fashion analysis. MV-Fashion features 3,273 sequences (72.5 million frames) from 80 diverse subjects wearing 3-10 outfits each. It is designed to capture complex, real-world garment dynamics, including multiple layers and varied styling (e.g. rolled sleeves, tucked shirt). A core contribution is a rich data representation that includes pixel-level semantic annotations, ground-truth material properties like elasticity, and 3D point clouds. Crucially for VTON applications, MV-Fashion provides paired data: multi-view synchronized captures of worn garments alongside their corresponding flat, catalogue images. We leverage this dataset to establish baselines for fashion-centric tasks, including virtual try-on, clothing size estimation, and novel view synthesis. The dataset is available at https://hunorlaczko.github.io/MV-Fashion .

MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data

TL;DR

MV-Fashion is introduced, a large-scale, multi-view video dataset engineered for domain-specific fashion analysis and leveraged to establish baselines for fashion-centric tasks, including virtual try-on, clothing size estimation, and novel view synthesis.

Abstract

Existing 4D human datasets fall short for fashion-specific research, lacking either realistic garment dynamics or task-specific annotations. Synthetic datasets suffer from a realism gap, whereas real-world captures lack the detailed annotations and paired data required for virtual try-on (VTON) and size estimation tasks. To bridge this gap, we introduce MV-Fashion, a large-scale, multi-view video dataset engineered for domain-specific fashion analysis. MV-Fashion features 3,273 sequences (72.5 million frames) from 80 diverse subjects wearing 3-10 outfits each. It is designed to capture complex, real-world garment dynamics, including multiple layers and varied styling (e.g. rolled sleeves, tucked shirt). A core contribution is a rich data representation that includes pixel-level semantic annotations, ground-truth material properties like elasticity, and 3D point clouds. Crucially for VTON applications, MV-Fashion provides paired data: multi-view synchronized captures of worn garments alongside their corresponding flat, catalogue images. We leverage this dataset to establish baselines for fashion-centric tasks, including virtual try-on, clothing size estimation, and novel view synthesis. The dataset is available at https://hunorlaczko.github.io/MV-Fashion .
Paper Structure (37 sections, 1 equation, 30 figures, 16 tables)

This paper contains 37 sections, 1 equation, 30 figures, 16 tables.

Figures (30)

  • Figure 1: We present MV-Fashion, a multi-view synchronized video dataset with 72.5 million frames. The dataset contains diverse clothing, multi-layered outfits annotated with draping styles, and paired catalogue domain data, ready for virtual try-on and fashion-centric tasks. The capture setup, shown in the middle, features 60 RGB cameras (blue) and 8 RGB-depth cameras (green) with 4K footage.
  • Figure 2: Available data for one subject in a single clothing set. (a) Subject: Frontal view, clothing layers (w & w/o jacket), styles (open & closed jacket) and template recordings, four representative RGB views, and depth images with the reconstructed point cloud. (b) Image Annotations: Foreground segmentation isolating the subject, garment segmentation for each clothing item, fitted SMPL-X body model, and labelled bounding boxes for all garments in each frame. (c) Garments, Materials & Sizing: For each garment, frontal and back views in stretched and normal states, text description of the clothing, sizing chart with corresponding measurements (in centimeters), and properties.
  • Figure 3: Qualitative comparison showing models trained with styling-augmented data respond to fine-grained styling prompts (a jacket, the outer wear is fully open), producing outputs that are sometimes distinguishable from no-style prompts (a jacket).
  • Figure 4: Results on (a) Cross-View Geometric Test vs (b) View-Adaptive Try-On. The updated IDM-VTON architecture can map between the catalogue view and the person's pose when both front and rear images of the garment are provided to the model.
  • Figure 5: Qualitative results of the canonical garment normal predictor $\Psi$ for groups G2, G4, G5 and G6; showing input frame, ground truth ($G^t$), and the predicted normal image.
  • ...and 25 more figures