Table of Contents
Fetching ...

BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off

Shuang Liu, Ao Yu, Linkang Cheng, Xiwen Huang, Li Zhao, Junhui Liu, Zhiting Lin, Yu Liu

TL;DR

BridgeDiff is proposed, a diffusion-based framework that explicitly bridges human-centric observations and flat-garment synthesis through two complementary components that achieves state-of-the-art performance, producing higher-quality flat-garment reconstructions while preserving fine-grained appearance and structural integrity.

Abstract

Virtual try-off (VTOFF) aims to recover canonical flat-garment representations from images of dressed persons for standardized display and downstream virtual try-on. Prior methods often treat VTOFF as direct image translation driven by local masks or text-only prompts, overlooking the gap between on-body appearances and flat layouts. This gap frequently leads to inconsistent completion in unobserved regions and unstable garment structure. We propose BridgeDiff, a diffusion-based framework that explicitly bridges human-centric observations and flat-garment synthesis through two complementary components. First, the Garment Condition Bridge Module (GCBM) builds a garment-cue representation that captures global appearance and semantic identity, enabling robust inference of continuous details under partial visibility. Second, the Flat Structure Constraint Module (FSCM) injects explicit flat-garment structural priors via Flat-Constraint Attention (FC-Attention) at selected denoising stages, improving structural stability beyond text-only conditioning. Extensive experiments on standard VTOFF benchmarks show that BridgeDiff achieves state-of-the-art performance, producing higher-quality flat-garment reconstructions while preserving fine-grained appearance and structural integrity.

BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off

TL;DR

BridgeDiff is proposed, a diffusion-based framework that explicitly bridges human-centric observations and flat-garment synthesis through two complementary components that achieves state-of-the-art performance, producing higher-quality flat-garment reconstructions while preserving fine-grained appearance and structural integrity.

Abstract

Virtual try-off (VTOFF) aims to recover canonical flat-garment representations from images of dressed persons for standardized display and downstream virtual try-on. Prior methods often treat VTOFF as direct image translation driven by local masks or text-only prompts, overlooking the gap between on-body appearances and flat layouts. This gap frequently leads to inconsistent completion in unobserved regions and unstable garment structure. We propose BridgeDiff, a diffusion-based framework that explicitly bridges human-centric observations and flat-garment synthesis through two complementary components. First, the Garment Condition Bridge Module (GCBM) builds a garment-cue representation that captures global appearance and semantic identity, enabling robust inference of continuous details under partial visibility. Second, the Flat Structure Constraint Module (FSCM) injects explicit flat-garment structural priors via Flat-Constraint Attention (FC-Attention) at selected denoising stages, improving structural stability beyond text-only conditioning. Extensive experiments on standard VTOFF benchmarks show that BridgeDiff achieves state-of-the-art performance, producing higher-quality flat-garment reconstructions while preserving fine-grained appearance and structural integrity.
Paper Structure (28 sections, 15 equations, 16 figures, 7 tables)

This paper contains 28 sections, 15 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Existing methods often suffer from visual discontinuity and structural instability when relying solely on textual conditioning or weak local constraints, especially under occlusions or partial observations. In contrast, BridgeDiff bridges dressed-person observations and canonical flat-garment representations via garment cues representation and explicit flat structure guidance.
  • Figure 2: Overview of the proposed GCBM. Rather than directly mapping dressed-person observations to flat garment images, GCBM aggregates multiple visual information into garment cues representation, capturing the global appearance and identity of the target garment to support visually continuous flat-garment synthesis.
  • Figure 3: Overview of the proposed Flat Structure Constraint for Conditional Diffusion architecture. The framework consists of a trainable model UNet and a largely frozen denoising UNet. To explicitly enforce flat garment layouts, a flat structure constraint module (FSCM) is integrated into the denoising UNet, ensuring stable layout generation without compromising appearance fidelity.
  • Figure 4: Qualitative comparisons on the DressCode dataset. Red circles highlight differences in local regions across different methods. Unmarked examples indicate cases where the overall garment structure or color appearance differs from the reference. Zooming in provides a clearer view of these differences.
  • Figure 5: Qualitative comparisons on the VITON-HD dataset. Red circles highlight differences in local regions across different methods. Unmarked examples indicate cases where the overall garment structure or color appearance differs from the reference. Zooming in provides a clearer view of these differences.
  • ...and 11 more figures