Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off
Seungyong Lee, Jeong-gi Kwak
TL;DR
Voost addresses the challenge of accurate garment–body alignment in virtual try-on by unifying virtual try-on and try-off into a single diffusion-transformer framework that jointly learns both directions. It introduces a token-level concatenation with a shared conditioning layout and two inference-time refinements—attention-temperature scaling and self-corrective sampling—while finetuning only the attention modules to preserve diffusion priors. Through extensive experiments on DressCode, VITON-HD, and in-the-wild images, Voost achieves state-of-the-art results for both try-on and try-off tasks, outperforming task-specific baselines in alignment, fidelity, and generalization. The approach eliminates the need for separate models or auxiliary losses, enabling scalable, bidirectional garment–person reasoning with practical inference-time improvements and strong user-perceived realism.
Abstract
Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.
