Table of Contents
Fetching ...

Self-Supervised Vision Transformer for Enhanced Virtual Clothes Try-On

Lingxiao Lu, Shengyi Wu, Haoxuan Sun, Junhong Gou, Jianlou Si, Chen Qian, Jianfu Zhang, Liqing Zhang

TL;DR

This research introduces an innovative approach for virtual clothes try-on, utilizing a self-supervised Vision Transformer (ViT) coupled with a diffusion model that empower the diffusion model to reproduce clothing details with increased clarity and realism.

Abstract

Virtual clothes try-on has emerged as a vital feature in online shopping, offering consumers a critical tool to visualize how clothing fits. In our research, we introduce an innovative approach for virtual clothes try-on, utilizing a self-supervised Vision Transformer (ViT) coupled with a diffusion model. Our method emphasizes detail enhancement by contrasting local clothing image embeddings, generated by ViT, with their global counterparts. Techniques such as conditional guidance and focus on key regions have been integrated into our approach. These combined strategies empower the diffusion model to reproduce clothing details with increased clarity and realism. The experimental results showcase substantial advancements in the realism and precision of details in virtual try-on experiences, significantly surpassing the capabilities of existing technologies.

Self-Supervised Vision Transformer for Enhanced Virtual Clothes Try-On

TL;DR

This research introduces an innovative approach for virtual clothes try-on, utilizing a self-supervised Vision Transformer (ViT) coupled with a diffusion model that empower the diffusion model to reproduce clothing details with increased clarity and realism.

Abstract

Virtual clothes try-on has emerged as a vital feature in online shopping, offering consumers a critical tool to visualize how clothing fits. In our research, we introduce an innovative approach for virtual clothes try-on, utilizing a self-supervised Vision Transformer (ViT) coupled with a diffusion model. Our method emphasizes detail enhancement by contrasting local clothing image embeddings, generated by ViT, with their global counterparts. Techniques such as conditional guidance and focus on key regions have been integrated into our approach. These combined strategies empower the diffusion model to reproduce clothing details with increased clarity and realism. The experimental results showcase substantial advancements in the realism and precision of details in virtual try-on experiences, significantly surpassing the capabilities of existing technologies.
Paper Structure (18 sections, 3 equations, 8 figures, 3 tables)

This paper contains 18 sections, 3 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overall framework of our network. we utilize the Stable Diffusion (SD) Inpainting network and employ a specially finetuned Vision Transformer (ViT) to direct the network's focus towards intricate clothes image details. The finetuned ViT, denoted as $\tau$, also functions as an essential feature extractor, instrumental in calculating the loss and further refining the inpainting process. Alongside, we integrate warp features into the input to enhance the alignment between the network's internal features and those in the given condition. For simplicity in representation, we omit the encoder $E$ and the decoder $D$ of the SD network in our depiction.
  • Figure 2: Visualization of the Average Head's Attention for the Class Token in ViT. "SS-" represents the scenario without any finetuning, "SS+RF" indicates the use of random local crops for self-supervised finetuning, and 'SS+SF' signifies the application of our method, which involves selectively choosing local crops for self-supervised finetuning.
  • Figure 3: In this visualization, (a) displays the original image input to the condition encoder $\tau$. Subfigure (b) illustrates the attention maps of two specific heads within the self-attention mechanism of ViT, highlighting areas of focus. Subfigure (c) shows the focal points derived from the attention maps presented in (b), pinpointing the specific areas receiving the highest attention. The aggregation of focal points across all heads is depicted in (d), demonstrating the comprehensive attention landscape. Based on the focal points in (d), clustering is conducted to identify key cluster centers, which are prominently marked in red in subfigure (e), indicating areas of significant attention across all heads.
  • Figure 4: Qualitative comparisons.
  • Figure 5: Visualization of Limitations of our method.
  • ...and 3 more figures