HYB-VITON: A Hybrid Approach to Virtual Try-On Combining Explicit and Implicit Warping
Kosuke Takemoto, Takafumi Koshinaka
TL;DR
The paper addresses the trade-off in image-based virtual try-on between detail-rich explicit warping and realism-focused implicit warping. It introduces HYB-VITON, a hybrid framework that preprocesses explicitly warped garments and injects them into a diffusion-based inpainting pipeline while balancing explicit and implicit cues via per-layer cross-attention scaling. Across VITON-HD, HYB-VITON yields superior garment detail fidelity compared with diffusion-based methods and greater realism than leading explicit-warping baselines, demonstrated by both qualitative and quantitative results. This approach offers a practical path to high-fidelity, realistic virtual try-on and sets a benchmark for integrating explicit-region information into implicit-generation paradigms.
Abstract
Virtual try-on systems have significant potential in e-commerce, allowing customers to visualize garments on themselves. Existing image-based methods fall into two categories: those that directly warp garment-images onto person-images (explicit warping), and those using cross-attention to reconstruct given garments (implicit warping). Explicit warping preserves garment details but often produces unrealistic output, while implicit warping achieves natural reconstruction but struggles with fine details. We propose HYB-VITON, a novel approach that combines the advantages of each method and includes both a preprocessing pipeline for warped garments and a novel training option. These components allow us to utilize beneficial regions of explicitly warped garments while leveraging the natural reconstruction of implicit warping. A series of experiments demonstrates that HYB-VITON preserves garment details more faithfully than recent diffusion-based methods, while producing more realistic results than a state-of-the-art explicit warping method.
