Table of Contents
Fetching ...

VTON-IT: Virtual Try-On using Image Translation

Santosh Adhikari, Bishnu Bhusal, Prashant Ghimire, Anil Shrestha

TL;DR

VTON-IT addresses realistic 2D virtual try-on under varying poses and occlusions by integrating a detection-segmentation-translation pipeline. It combines YOLOv5-based person detection, U^2-Net body-part segmentation, and a pix2pix-like image-translation network with a multi-scale discriminator and perceptual/feature-matching losses to wrap clothing onto the segmented region. Compared to CP-VTON+, VTON-IT achieves higher fidelity and texture realism on high-resolution inputs, validated by qualitative comparisons, quantitative metrics, and a user study. The approach demonstrates robustness to outdoor conditions and multi-person scenarios, with future work proposed to expand garment types and enable cross-domain synthesis with controllable parameters.

Abstract

Virtual Try-On (trying clothes virtually) is a promising application of the Generative Adversarial Network (GAN). However, it is an arduous task to transfer the desired clothing item onto the corresponding regions of a human body because of varying body size, pose, and occlusions like hair and overlapped clothes. In this paper, we try to produce photo-realistic translated images through semantic segmentation and a generative adversarial architecture-based image translation network. We present a novel image-based Virtual Try-On application VTON-IT that takes an RGB image, segments desired body part, and overlays target cloth over the segmented body region. Most state-of-the-art GAN-based Virtual Try-On applications produce unaligned pixelated synthesis images on real-life test images. However, our approach generates high-resolution natural images with detailed textures on such variant images.

VTON-IT: Virtual Try-On using Image Translation

TL;DR

VTON-IT addresses realistic 2D virtual try-on under varying poses and occlusions by integrating a detection-segmentation-translation pipeline. It combines YOLOv5-based person detection, U^2-Net body-part segmentation, and a pix2pix-like image-translation network with a multi-scale discriminator and perceptual/feature-matching losses to wrap clothing onto the segmented region. Compared to CP-VTON+, VTON-IT achieves higher fidelity and texture realism on high-resolution inputs, validated by qualitative comparisons, quantitative metrics, and a user study. The approach demonstrates robustness to outdoor conditions and multi-person scenarios, with future work proposed to expand garment types and enable cross-domain synthesis with controllable parameters.

Abstract

Virtual Try-On (trying clothes virtually) is a promising application of the Generative Adversarial Network (GAN). However, it is an arduous task to transfer the desired clothing item onto the corresponding regions of a human body because of varying body size, pose, and occlusions like hair and overlapped clothes. In this paper, we try to produce photo-realistic translated images through semantic segmentation and a generative adversarial architecture-based image translation network. We present a novel image-based Virtual Try-On application VTON-IT that takes an RGB image, segments desired body part, and overlays target cloth over the segmented body region. Most state-of-the-art GAN-based Virtual Try-On applications produce unaligned pixelated synthesis images on real-life test images. However, our approach generates high-resolution natural images with detailed textures on such variant images.
Paper Structure (22 sections, 4 equations, 5 figures, 1 table)

This paper contains 22 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Proposed VITON-IT overview. First, the human body is detected and cropped. Then, the desired body region is segmented through $U^2$-Net architecture and the segmented mask is fed to the image translation network to generate wrapped cloth. Finally, the wrapped cloth is overlayed over the input image.
  • Figure 2: Virtual Try-On Architecture: An input image is first fed to the YOLOv5 object detection model to detect the human body, which is then cropped. The cropped image is then passed through the $U^2$-Net segmentation model to generate a body region mask. Finally, the mask is fed into the Pix2Pix generator, which synthesizes RGB-based clothing onto the masked body region, resulting in a virtual try-on of the clothing. The output image shows the synthesized clothing on the original human body image.
  • Figure 3: Example training image for human segmentation and image translation network
  • Figure 4: Visualized comparison with CP-VTON+
  • Figure 5: Result on outdoor images