Table of Contents
Fetching ...

One Model for All: Unified Try-On and Try-Off in Any Pose via LLM-Inspired Bidirectional Tweedie Diffusion

Jinxi Liu, Zijian He, Guangrun Wang, Guanbin Li, Liang Lin

TL;DR

The paper tackles the pose-rigid limitation of prior virtual try-on/try-off methods by proposing OMFA, a unified diffusion-based model that handles try-on and try-off in a single framework without templates. It introduces a Bidirectional Tweedie Diffusion in a latent space, guided by LLM-inspired conditional generation, and injects explicit 3D geometry via SMPL-X conditioning to support arbitrary poses and multi-view synthesis from a single image. The approach achieves state-of-the-art results on VITON-HD and DeepFashion-MultiModal across both tasks, with strong qualitative results and comprehensive ablations. The work offers a practical, flexible solution for real-world garment editing and transfer, potentially enabling scalable, user-controlled virtual try-on experiences.

Abstract

Recent diffusion-based approaches have made significant advances in image-based virtual try-on, enabling more realistic and end-to-end garment synthesis. However, most existing methods remain constrained by their reliance on exhibition garments and segmentation masks, as well as their limited ability to handle flexible pose variations. These limitations reduce their practicality in real-world scenarios - for instance, users cannot easily transfer garments worn by one person onto another, and the generated try-on results are typically restricted to the same pose as the reference image. In this paper, we introduce OMFA (One Model For All), a unified diffusion framework for both virtual try-on and try-off that operates without the need for exhibition garments and supports arbitrary poses. OMFA is inspired by language modeling, where generation is guided by conditioning prompts. However, our framework differs fundamentally from LLMs in two key aspects. First, it employs a bidirectional modeling paradigm that symmetrically allows prompting either from the garment to generate try-on results or from the dressed person to recover the try-off garment. Second, it strictly adheres to Tweedie's formula, enabling faithful estimation of the underlying data distribution during the denoising process. Instead of imposing lower body constraints, OMFA is an entirely mask-free framework that requires only a single portrait and a target garment as input, and is designed to support flexible outfit combinations and cross-person garment transfer, making it better aligned with practical usage scenarios. Additionally, by leveraging SMPL-X-based pose conditioning, OMFA supports multi-view and arbitrary-pose try-on from just one image. Extensive experiments demonstrate that OMFA achieves state-of-the-art results on both try-on and try-off tasks, providing a practical solution for virtual garment synthesis.

One Model for All: Unified Try-On and Try-Off in Any Pose via LLM-Inspired Bidirectional Tweedie Diffusion

TL;DR

The paper tackles the pose-rigid limitation of prior virtual try-on/try-off methods by proposing OMFA, a unified diffusion-based model that handles try-on and try-off in a single framework without templates. It introduces a Bidirectional Tweedie Diffusion in a latent space, guided by LLM-inspired conditional generation, and injects explicit 3D geometry via SMPL-X conditioning to support arbitrary poses and multi-view synthesis from a single image. The approach achieves state-of-the-art results on VITON-HD and DeepFashion-MultiModal across both tasks, with strong qualitative results and comprehensive ablations. The work offers a practical, flexible solution for real-world garment editing and transfer, potentially enabling scalable, user-controlled virtual try-on experiences.

Abstract

Recent diffusion-based approaches have made significant advances in image-based virtual try-on, enabling more realistic and end-to-end garment synthesis. However, most existing methods remain constrained by their reliance on exhibition garments and segmentation masks, as well as their limited ability to handle flexible pose variations. These limitations reduce their practicality in real-world scenarios - for instance, users cannot easily transfer garments worn by one person onto another, and the generated try-on results are typically restricted to the same pose as the reference image. In this paper, we introduce OMFA (One Model For All), a unified diffusion framework for both virtual try-on and try-off that operates without the need for exhibition garments and supports arbitrary poses. OMFA is inspired by language modeling, where generation is guided by conditioning prompts. However, our framework differs fundamentally from LLMs in two key aspects. First, it employs a bidirectional modeling paradigm that symmetrically allows prompting either from the garment to generate try-on results or from the dressed person to recover the try-off garment. Second, it strictly adheres to Tweedie's formula, enabling faithful estimation of the underlying data distribution during the denoising process. Instead of imposing lower body constraints, OMFA is an entirely mask-free framework that requires only a single portrait and a target garment as input, and is designed to support flexible outfit combinations and cross-person garment transfer, making it better aligned with practical usage scenarios. Additionally, by leveraging SMPL-X-based pose conditioning, OMFA supports multi-view and arbitrary-pose try-on from just one image. Extensive experiments demonstrate that OMFA achieves state-of-the-art results on both try-on and try-off tasks, providing a practical solution for virtual garment synthesis.

Paper Structure

This paper contains 43 sections, 11 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Outfitted models generated by OMFA. (a) Person-to-person try-on, (b) garment swapping between persons and (c) multi-pose try-on. Please zoom in to better observe the details.
  • Figure 2: Overview of our proposed OMFA (One Model For All) framework. (a) illustrates the pipeline of person-to-person try-on, including two processes of try-off and try-on in one model. (b) depicts a model design based on the LLM-inspired bidirectional diffusion. The model's inputs are the latent token sequence, with noise added to the person image (try-on stream) or the garment image (try-off stream). (c) presents the multi-pose try-on support of our framework.
  • Figure 3: Qualitative evaluation of virtual try-on on VITON-HD dataset. OMFA shows a clear advantage in handling person-to-person virtual try-on.
  • Figure 4: Qualitative comparison of multi-pose try-on results with IDM-VTON on DeepFashion-Multimodal. To adapt the input of IDM-VTON, we keep the agnostic mask unchanged and replace the input DensePose representation with the target pose to investigate its capability for pose transfer.
  • Figure 5: Qualitative comparison of TryoffDiff-combined try-on pipelines and our unified framework. Methods combined with TryOffDiff tend to blur patterns, whereas our method better preserves garment details.
  • ...and 11 more figures