Table of Contents
Fetching ...

Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off

Fulvio Sanguigni, Davide Lobba, Bin Ren, Marcella Cornia, Nicu Sebe, Rita Cucchiara

Abstract

Recent advances in Virtual Try-On (VTON) and Virtual Try-Off (VTOFF) have greatly improved photo-realistic fashion synthesis and garment reconstruction. However, existing datasets remain static, lacking instruction-driven editing for controllable and interactive fashion generation. In this work, we introduce the Dress Editing Dataset (Dress-ED), the first large-scale benchmark that unifies VTON, VTOFF, and text-guided garment editing within a single framework. Each sample in Dress-ED includes an in-shop garment image, the corresponding person image wearing the garment, their edited counterparts, and a natural-language instruction of the desired modification. Built through a fully automated multimodal pipeline that integrates MLLM-based garment understanding, diffusion-based editing, and LLM-guided verification, Dress-ED comprises over 146k verified quadruplets spanning three garment categories and seven edit types, including both appearance (e.g., color, pattern, material) and structural (e.g., sleeve length, neckline) modifications. Based on this benchmark, we further propose a unified multimodal diffusion framework that jointly reasons over linguistic instructions and visual garment cues, serving as a strong baseline for instruction-driven VTON and VTOFF. Dataset and code will be made publicly available.

Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off

Abstract

Recent advances in Virtual Try-On (VTON) and Virtual Try-Off (VTOFF) have greatly improved photo-realistic fashion synthesis and garment reconstruction. However, existing datasets remain static, lacking instruction-driven editing for controllable and interactive fashion generation. In this work, we introduce the Dress Editing Dataset (Dress-ED), the first large-scale benchmark that unifies VTON, VTOFF, and text-guided garment editing within a single framework. Each sample in Dress-ED includes an in-shop garment image, the corresponding person image wearing the garment, their edited counterparts, and a natural-language instruction of the desired modification. Built through a fully automated multimodal pipeline that integrates MLLM-based garment understanding, diffusion-based editing, and LLM-guided verification, Dress-ED comprises over 146k verified quadruplets spanning three garment categories and seven edit types, including both appearance (e.g., color, pattern, material) and structural (e.g., sleeve length, neckline) modifications. Based on this benchmark, we further propose a unified multimodal diffusion framework that jointly reasons over linguistic instructions and visual garment cues, serving as a strong baseline for instruction-driven VTON and VTOFF. Dataset and code will be made publicly available.
Paper Structure (25 sections, 8 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 25 sections, 8 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: We propose the Dress Editing Dataset (Dress-ED), the first benchmark for instruction-driven virtual try-on and try-off with over 146k verified samples across seven editing types, including both appearance and structural modifications.
  • Figure 2: Overview of the Dress-ED curation pipeline. Starting from Dress Code morelli2022dresscode, (i) structured garment attributes are extracted with Qwen3-VL and used to generate natural-language edit instructions, (ii) edited in-shop garments are synthesized via FLUX.2 Klein, (iii) the corresponding edited try-on images are generated using FitDiT, and (iv) all results are validated through a version of InternVL-3.5, finetuned with samples annotated using GPT-5 openaigpt-5, to ensure semantic and visual consistency.
  • Figure 3: Overview of the proposed Dress-EM architecture for instruction-driven fashion editing.
  • Figure 4: Qualitative results for edited VTON unpaired setting (first two rows) and edited VTOFF (last two rows), showing realistic and instruction-consistent edits.
  • Figure 5: Sample of label maps, masks, in-shop garment and FitDiT result.
  • ...and 5 more figures