Table of Contents
Fetching ...

E-MD3C: Taming Masked Diffusion Transformers for Efficient Zero-Shot Object Customization

Trung X. Pham, Zhang Kang, Ji Woo Hong, Xuran Zheng, Chang D. Yoo

TL;DR

E-MD3C introduces a lightweight masked diffusion transformer for zero-shot object customization, operating on latent patches to dramatically reduce parameters and compute compared with Unet-based latent diffusion. The method decouples conditioning into a Disentangled Masked Diffusion Module and a Learnable Conditions Collector (CCNet), enabling efficient denoising with two branches and a compact conditional vector. It combines a Denoising Transformer-based Diffusion Network (DTDNet) with a dynamically guided, disentangled conditioning scheme, achieving competitive quality on VITON-HD while delivering up to 2.5x faster inference and 1/3 lower memory usage. Extensive experiments and ablations demonstrate robustness across views and scenarios, highlighting practical impact for real-world, resource-constrained applications in zero-shot object customization.

Abstract

We propose E-MD3C ($\underline{E}$fficient $\underline{M}$asked $\underline{D}$iffusion Transformer with Disentangled $\underline{C}$onditions and $\underline{C}$ompact $\underline{C}$ollector), a highly efficient framework for zero-shot object image customization. Unlike prior works reliant on resource-intensive Unet architectures, our approach employs lightweight masked diffusion transformers operating on latent patches, offering significantly improved computational efficiency. The framework integrates three core components: (1) an efficient masked diffusion transformer for processing autoencoder latents, (2) a disentangled condition design that ensures compactness while preserving background alignment and fine details, and (3) a learnable Conditions Collector that consolidates multiple inputs into a compact representation for efficient denoising and learning. E-MD3C outperforms the existing approach on the VITON-HD dataset across metrics such as PSNR, FID, SSIM, and LPIPS, demonstrating clear advantages in parameters, memory efficiency, and inference speed. With only $\frac{1}{4}$ of the parameters, our Transformer-based 468M model delivers $2.5\times$ faster inference and uses $\frac{2}{3}$ of the GPU memory compared to an 1720M Unet-based latent diffusion model.

E-MD3C: Taming Masked Diffusion Transformers for Efficient Zero-Shot Object Customization

TL;DR

E-MD3C introduces a lightweight masked diffusion transformer for zero-shot object customization, operating on latent patches to dramatically reduce parameters and compute compared with Unet-based latent diffusion. The method decouples conditioning into a Disentangled Masked Diffusion Module and a Learnable Conditions Collector (CCNet), enabling efficient denoising with two branches and a compact conditional vector. It combines a Denoising Transformer-based Diffusion Network (DTDNet) with a dynamically guided, disentangled conditioning scheme, achieving competitive quality on VITON-HD while delivering up to 2.5x faster inference and 1/3 lower memory usage. Extensive experiments and ablations demonstrate robustness across views and scenarios, highlighting practical impact for real-world, resource-constrained applications in zero-shot object customization.

Abstract

We propose E-MD3C (fficient asked iffusion Transformer with Disentangled onditions and ompact ollector), a highly efficient framework for zero-shot object image customization. Unlike prior works reliant on resource-intensive Unet architectures, our approach employs lightweight masked diffusion transformers operating on latent patches, offering significantly improved computational efficiency. The framework integrates three core components: (1) an efficient masked diffusion transformer for processing autoencoder latents, (2) a disentangled condition design that ensures compactness while preserving background alignment and fine details, and (3) a learnable Conditions Collector that consolidates multiple inputs into a compact representation for efficient denoising and learning. E-MD3C outperforms the existing approach on the VITON-HD dataset across metrics such as PSNR, FID, SSIM, and LPIPS, demonstrating clear advantages in parameters, memory efficiency, and inference speed. With only of the parameters, our Transformer-based 468M model delivers faster inference and uses of the GPU memory compared to an 1720M Unet-based latent diffusion model.

Paper Structure

This paper contains 33 sections, 8 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Existing Approach Inefficiency. The current model (black) demands significant parameters, memory, and inference time due to its resource-intensive architecture.
  • Figure 2: Object Composition. The 3$^{\text{rd}}$ and 4$^{\text{th}}$ columns show outputs from the existing method and our model. Our model generates images in just over 2 seconds, compared to 7 seconds for the existing approach.
  • Figure 3: Zero-shot object customization and its practical applications. Images are generated using our E-MD3C model.
  • Figure 4: Overview of the E-MD3C framework for zero-shot object customization. During training, 30% of patched tokens are masked, and the noisy input is processed by the Diffusion Transformer, conditioned on a collected vector ($D=1024$) via AdaLN modulation peebles2023scalable. A mask prediction objective models token relationships. The red arrow$\color{red}\rightarrow$ is training-only, the black arrow$\color{black}\rightarrow$ is used for both training and inference, and the green arrow$\color{green}\rightarrow$ is inference-only.
  • Figure 5: Training data with diverse object sizes.In the pixel space ($512 \times 512$), objects of varying sizes and positions train the model, with black areas marking masked objects in bounding boxes. In the latent space ($64 \times 64$), box position is preserved.
  • ...and 9 more figures