Table of Contents
Fetching ...

DreamFit: Garment-Centric Human Generation via a Lightweight Anything-Dressing Encoder

Ente Lin, Xujie Zhang, Fuwei Zhao, Yuxuan Luo, Xin Dong, Long Zeng, Xiaodan Liang

TL;DR

DreamFit introduces a lightweight, garment-centric human generation framework by replacing bulky garment encoders with a LoRA-based Anything-Dressing Encoder embedded in a frozen diffusion UNet and guided by an adaptive attention mechanism. By coupling this with LMM-powered prompt enrichment during inference, DreamFit narrows the training–inference prompt gap and preserves texture fidelity with only $83.4$M trainable parameters, outperforming state-of-the-art baselines on open and internal datasets at $768\times512$. The approach maintains plug-and-play compatibility with community control plugins and scales to SDXL and FLUX architectures, offering strong generalization across diverse garments and styles. Overall, DreamFit achieves text and texture consistency with high-quality garment details while significantly improving training efficiency and accessibility for garment-centric diffusion generation.

Abstract

Diffusion models for garment-centric human generation from text or image prompts have garnered emerging attention for their great application potential. However, existing methods often face a dilemma: lightweight approaches, such as adapters, are prone to generate inconsistent textures; while finetune-based methods involve high training costs and struggle to maintain the generalization capabilities of pretrained diffusion models, limiting their performance across diverse scenarios. To address these challenges, we propose DreamFit, which incorporates a lightweight Anything-Dressing Encoder specifically tailored for the garment-centric human generation. DreamFit has three key advantages: (1) \textbf{Lightweight training}: with the proposed adaptive attention and LoRA modules, DreamFit significantly minimizes the model complexity to 83.4M trainable parameters. (2)\textbf{Anything-Dressing}: Our model generalizes surprisingly well to a wide range of (non-)garments, creative styles, and prompt instructions, consistently delivering high-quality results across diverse scenarios. (3) \textbf{Plug-and-play}: DreamFit is engineered for smooth integration with any community control plugins for diffusion models, ensuring easy compatibility and minimizing adoption barriers. To further enhance generation quality, DreamFit leverages pretrained large multi-modal models (LMMs) to enrich the prompt with fine-grained garment descriptions, thereby reducing the prompt gap between training and inference. We conduct comprehensive experiments on both $768 \times 512$ high-resolution benchmarks and in-the-wild images. DreamFit surpasses all existing methods, highlighting its state-of-the-art capabilities of garment-centric human generation.

DreamFit: Garment-Centric Human Generation via a Lightweight Anything-Dressing Encoder

TL;DR

DreamFit introduces a lightweight, garment-centric human generation framework by replacing bulky garment encoders with a LoRA-based Anything-Dressing Encoder embedded in a frozen diffusion UNet and guided by an adaptive attention mechanism. By coupling this with LMM-powered prompt enrichment during inference, DreamFit narrows the training–inference prompt gap and preserves texture fidelity with only M trainable parameters, outperforming state-of-the-art baselines on open and internal datasets at . The approach maintains plug-and-play compatibility with community control plugins and scales to SDXL and FLUX architectures, offering strong generalization across diverse garments and styles. Overall, DreamFit achieves text and texture consistency with high-quality garment details while significantly improving training efficiency and accessibility for garment-centric diffusion generation.

Abstract

Diffusion models for garment-centric human generation from text or image prompts have garnered emerging attention for their great application potential. However, existing methods often face a dilemma: lightweight approaches, such as adapters, are prone to generate inconsistent textures; while finetune-based methods involve high training costs and struggle to maintain the generalization capabilities of pretrained diffusion models, limiting their performance across diverse scenarios. To address these challenges, we propose DreamFit, which incorporates a lightweight Anything-Dressing Encoder specifically tailored for the garment-centric human generation. DreamFit has three key advantages: (1) \textbf{Lightweight training}: with the proposed adaptive attention and LoRA modules, DreamFit significantly minimizes the model complexity to 83.4M trainable parameters. (2)\textbf{Anything-Dressing}: Our model generalizes surprisingly well to a wide range of (non-)garments, creative styles, and prompt instructions, consistently delivering high-quality results across diverse scenarios. (3) \textbf{Plug-and-play}: DreamFit is engineered for smooth integration with any community control plugins for diffusion models, ensuring easy compatibility and minimizing adoption barriers. To further enhance generation quality, DreamFit leverages pretrained large multi-modal models (LMMs) to enrich the prompt with fine-grained garment descriptions, thereby reducing the prompt gap between training and inference. We conduct comprehensive experiments on both high-resolution benchmarks and in-the-wild images. DreamFit surpasses all existing methods, highlighting its state-of-the-art capabilities of garment-centric human generation.

Paper Structure

This paper contains 28 sections, 7 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Garment-centric human generation results of our DreamFit: TOP: DreamFit can synthesize human images with varied styles, backgrounds, and body shapes complying with the given clothing image and prompt. Middle: DreamFit is compatible with community plugins such as ControlNet zhang2023adding and FaceID ye2023ip. Bottom: DreamFit demonstrates superior performance compared to SOTA methods, achieving the highest levels of texture and texts consistency.
  • Figure 2: Performance comparison between baselines and our DreamFit. The circle size represents the number of trainable parameters, with larger circles indicating a higher parameter count. Higher CLIP-I and CLIP-T scores signify better alignment between the generated results and text descriptions. Our method not only achieves the best performance but also maintains much fewer training parameters.
  • Figure 3: Overview of DreamFit. Our method constructs an Anything-Dressing Encoder utilizing LoRA layers. The reference image features are extracted by the Anything-Dressing Encoder and then passed into the denoising UNet via adaptive attention. Furthermore, we incorporate Large Multimodal Models (LMM) into the inference process to reduce the text prompt gap between the training and testing.
  • Figure 4: Qualitative comparison on the open and internal benchmarks. DreamFit demonstrates a distinct advantage in handling complex patterns and text. Please zoom in for more details.
  • Figure 5: Plug-and-play results of DreamFit, our method can seamlessly integrate with community conditional control plugins.
  • ...and 10 more figures