Table of Contents
Fetching ...

MV-VTON: Multi-View Virtual Try-On with Diffusion Models

Haoyu Wang, Zhilu Zhang, Donglin Di, Shiliang Zhang, Wangmeng Zuo

TL;DR

MV-VTON addresses the limitation of frontal-only virtual try-on by enabling multi-view dressing synthesis from frontal and back clothing using a diffusion-based inpainting backbone conditioned on two-view clothing. It introduces a view-adaptive selection mechanism that produces global features $c_g$ and multi-scale local features $c_l^i$, and applies joint attention blocks to align and fuse clothing with the target person’s features, preserving high-frequency garment details. The approach is validated on the MVG dataset and shows state-of-the-art performance for multi-view VTON, while also delivering competitive results on frontal-view tasks like VITON-HD and DressCode. The MVG dataset and the proposed conditioning and fusion modules offer practical benefits for robust, deployment-ready multi-view virtual try-on applications.

Abstract

The goal of image-based virtual try-on is to generate an image of the target person naturally wearing the given clothing. However, existing methods solely focus on the frontal try-on using the frontal clothing. When the views of the clothing and person are significantly inconsistent, particularly when the person's view is non-frontal, the results are unsatisfactory. To address this challenge, we introduce Multi-View Virtual Try-ON (MV-VTON), which aims to reconstruct the dressing results from multiple views using the given clothes. Given that single-view clothes provide insufficient information for MV-VTON, we instead employ two images, i.e., the frontal and back views of the clothing, to encompass the complete view as much as possible. Moreover, we adopt diffusion models that have demonstrated superior abilities to perform our MV-VTON. In particular, we propose a view-adaptive selection method where hard-selection and soft-selection are applied to the global and local clothing feature extraction, respectively. This ensures that the clothing features are roughly fit to the person's view. Subsequently, we suggest joint attention blocks to align and fuse clothing features with person features. Additionally, we collect a MV-VTON dataset MVG, in which each person has multiple photos with diverse views and poses. Experiments show that the proposed method not only achieves state-of-the-art results on MV-VTON task using our MVG dataset, but also has superiority on frontal-view virtual try-on task using VITON-HD and DressCode datasets.

MV-VTON: Multi-View Virtual Try-On with Diffusion Models

TL;DR

MV-VTON addresses the limitation of frontal-only virtual try-on by enabling multi-view dressing synthesis from frontal and back clothing using a diffusion-based inpainting backbone conditioned on two-view clothing. It introduces a view-adaptive selection mechanism that produces global features and multi-scale local features , and applies joint attention blocks to align and fuse clothing with the target person’s features, preserving high-frequency garment details. The approach is validated on the MVG dataset and shows state-of-the-art performance for multi-view VTON, while also delivering competitive results on frontal-view tasks like VITON-HD and DressCode. The MVG dataset and the proposed conditioning and fusion modules offer practical benefits for robust, deployment-ready multi-view virtual try-on applications.

Abstract

The goal of image-based virtual try-on is to generate an image of the target person naturally wearing the given clothing. However, existing methods solely focus on the frontal try-on using the frontal clothing. When the views of the clothing and person are significantly inconsistent, particularly when the person's view is non-frontal, the results are unsatisfactory. To address this challenge, we introduce Multi-View Virtual Try-ON (MV-VTON), which aims to reconstruct the dressing results from multiple views using the given clothes. Given that single-view clothes provide insufficient information for MV-VTON, we instead employ two images, i.e., the frontal and back views of the clothing, to encompass the complete view as much as possible. Moreover, we adopt diffusion models that have demonstrated superior abilities to perform our MV-VTON. In particular, we propose a view-adaptive selection method where hard-selection and soft-selection are applied to the global and local clothing feature extraction, respectively. This ensures that the clothing features are roughly fit to the person's view. Subsequently, we suggest joint attention blocks to align and fuse clothing features with person features. Additionally, we collect a MV-VTON dataset MVG, in which each person has multiple photos with diverse views and poses. Experiments show that the proposed method not only achieves state-of-the-art results on MV-VTON task using our MVG dataset, but also has superiority on frontal-view virtual try-on task using VITON-HD and DressCode datasets.
Paper Structure (20 sections, 8 equations, 15 figures, 4 tables)

This paper contains 20 sections, 8 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Motivation of this work. Previous VTON methods, e.g., StableVITON kim2023stableviton can only be used for the frontal-view person, and fail when facing the person with multiple views. Our MV-VTON can faithfully present the try-on results for a person with various views.
  • Figure 2: Comparison between previous datasets and our proposed MVG dataset. (a) is the dataset used by the previous work, which only have clothing and person in the frontal-view. In contrast, our dataset (b) offers images from five different views.
  • Figure 3: (a) Overview of MV-VTON. It encodes frontal and back view clothing into global features using the CLIP image encoder and extracts multi-scale local features through an additional encoder $\mathcal{E}_l$. Both features act as conditional inputs for the decoder of backbone. Besides, both features are selectively extracted through view-adaptive selection mechanism. (b) Soft-selection modulates the clothing features on frontal and back views, respectively, based on the similarity between the clothing's pose and the person's pose. Then the features from both views are concatenated in the channel dimension.
  • Figure 4: Overview of the proposed joint attention blocks.
  • Figure 5: Qualitative comparisons on multi-view virtual try-on task with MVG dataset.
  • ...and 10 more figures