Table of Contents
Fetching ...

Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off

Seungyong Lee, Jeong-gi Kwak

TL;DR

Voost addresses the challenge of accurate garment–body alignment in virtual try-on by unifying virtual try-on and try-off into a single diffusion-transformer framework that jointly learns both directions. It introduces a token-level concatenation with a shared conditioning layout and two inference-time refinements—attention-temperature scaling and self-corrective sampling—while finetuning only the attention modules to preserve diffusion priors. Through extensive experiments on DressCode, VITON-HD, and in-the-wild images, Voost achieves state-of-the-art results for both try-on and try-off tasks, outperforming task-specific baselines in alignment, fidelity, and generalization. The approach eliminates the need for separate models or auxiliary losses, enabling scalable, bidirectional garment–person reasoning with practical inference-time improvements and strong user-perceived realism.

Abstract

Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.

Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off

TL;DR

Voost addresses the challenge of accurate garment–body alignment in virtual try-on by unifying virtual try-on and try-off into a single diffusion-transformer framework that jointly learns both directions. It introduces a token-level concatenation with a shared conditioning layout and two inference-time refinements—attention-temperature scaling and self-corrective sampling—while finetuning only the attention modules to preserve diffusion priors. Through extensive experiments on DressCode, VITON-HD, and in-the-wild images, Voost achieves state-of-the-art results for both try-on and try-off tasks, outperforming task-specific baselines in alignment, fidelity, and generalization. The approach eliminates the need for separate models or auxiliary losses, enabling scalable, bidirectional garment–person reasoning with practical inference-time improvements and strong user-perceived realism.

Abstract

Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.

Paper Structure

This paper contains 39 sections, 5 equations, 21 figures, 3 tables, 1 algorithm.

Figures (21)

  • Figure 1: Attention map comparison— CatVTON chong2024catvtonconcatenationneedvirtual shows dispersed attention unrelated to the query point, indicating weak spatial grounding. In contrast, our model produces sharply localized maps that align well with the corresponding garment regions, demonstrating stronger relational understanding.
  • Figure 2: Overview of pipeline. Voost enables bidirectional virtual try-on and try-off with a unified diffusion transformer for scalable learning.
  • Figure 3: Impact of temperature scaling. Adaptive temperature scaling enhances visual detail by adjusting attention behavior under varying spatial proportions of mask and garment regions.
  • Figure 4: Qualitative comparison of try-on results with existing try-on methods kim2024stablevitonidm-vtonchong2024catvtonconcatenationneedvirtualleffa. Best viewed in color and under zoom.
  • Figure 5: Additional qualitative comparison of try-on results with other methods. Best viewed in color and under zoom.
  • ...and 16 more figures