Table of Contents
Fetching ...

TryOffAnyone: Tiled Cloth Generation from a Dressed Person

Ioannis Xarchakos, Theodoros Koukopoulos

TL;DR

This work tackles the challenge of generating high-fidelity tiled garment lay-down images from photos of garments worn on models, enabling better online shopping experiences and virtual try-ons. The authors propose TryOffAnyone, a mask-guided latent diffusion framework built on a fine-tuned Stable Diffusion model, employing a single-stage denoising U-Net and training only transformer blocks to drastically reduce parameters to 267.24M while achieving state-of-the-art results on VITON-HD. Key contributions include mask-based conditioning instead of text prompts, a streamlined input-latent pipeline, and extensive ablations demonstrating the effectiveness of transformer-block fine-tuning and seed sensitivity. The method demonstrates strong quantitative and qualitative performance on VITON-HD and FBB, suggesting practical impact for scalable e-commerce applications and future improvements in texture refinement and analysis of seed effects.

Abstract

The fashion industry is increasingly leveraging computer vision and deep learning technologies to enhance online shopping experiences and operational efficiencies. In this paper, we address the challenge of generating high-fidelity tiled garment images essential for personalized recommendations, outfit composition, and virtual try-on systems from photos of garments worn by models. Inspired by the success of Latent Diffusion Models (LDMs) in image-to-image translation, we propose a novel approach utilizing a fine-tuned StableDiffusion model. Our method features a streamlined single-stage network design, which integrates garmentspecific masks to isolate and process target clothing items effectively. By simplifying the network architecture through selective training of transformer blocks and removing unnecessary crossattention layers, we significantly reduce computational complexity while achieving state-of-the-art performance on benchmark datasets like VITON-HD. Experimental results demonstrate the effectiveness of our approach in producing high-quality tiled garment images for both full-body and half-body inputs. Code and model are available at: https://github.com/ixarchakos/try-off-anyone

TryOffAnyone: Tiled Cloth Generation from a Dressed Person

TL;DR

This work tackles the challenge of generating high-fidelity tiled garment lay-down images from photos of garments worn on models, enabling better online shopping experiences and virtual try-ons. The authors propose TryOffAnyone, a mask-guided latent diffusion framework built on a fine-tuned Stable Diffusion model, employing a single-stage denoising U-Net and training only transformer blocks to drastically reduce parameters to 267.24M while achieving state-of-the-art results on VITON-HD. Key contributions include mask-based conditioning instead of text prompts, a streamlined input-latent pipeline, and extensive ablations demonstrating the effectiveness of transformer-block fine-tuning and seed sensitivity. The method demonstrates strong quantitative and qualitative performance on VITON-HD and FBB, suggesting practical impact for scalable e-commerce applications and future improvements in texture refinement and analysis of seed effects.

Abstract

The fashion industry is increasingly leveraging computer vision and deep learning technologies to enhance online shopping experiences and operational efficiencies. In this paper, we address the challenge of generating high-fidelity tiled garment images essential for personalized recommendations, outfit composition, and virtual try-on systems from photos of garments worn by models. Inspired by the success of Latent Diffusion Models (LDMs) in image-to-image translation, we propose a novel approach utilizing a fine-tuned StableDiffusion model. Our method features a streamlined single-stage network design, which integrates garmentspecific masks to isolate and process target clothing items effectively. By simplifying the network architecture through selective training of transformer blocks and removing unnecessary crossattention layers, we significantly reduce computational complexity while achieving state-of-the-art performance on benchmark datasets like VITON-HD. Experimental results demonstrate the effectiveness of our approach in producing high-quality tiled garment images for both full-body and half-body inputs. Code and model are available at: https://github.com/ixarchakos/try-off-anyone

Paper Structure

This paper contains 23 sections, 9 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Different garment image views
  • Figure 2: Cloth masking
  • Figure 3: TryOffAnyone Network Architecture
  • Figure 4: Qualitative comparison against TryOffDiff velioglu2024tryoffdiffvirtualtryoffhighfidelitygarment on VITON-HD vton-hd
  • Figure 5: Qualitative results on FBB dataset
  • ...and 1 more figures