Table of Contents
Fetching ...

Cross-view Masked Diffusion Transformers for Person Image Synthesis

Trung X. Pham, Zhang Kang, Chang D. Yoo

TL;DR

X-MDPT introduces a cross-view masked diffusion transformer for pose-guided human image synthesis, shifting from Unet to latent-patch transformers and adding CANet for unified conditioning and MIPNet for cross-view mask prediction. The model achieves state-of-the-art results on DeepFashion with a compact 33MB footprint and significantly faster inference than pixel-based methods, while maintaining high visual fidelity and view-consistency across poses. Through CANet, MIPNet, and CFG, the approach effectively fuses pose, local source, and global source information and learns cross-view correspondences, enabling stable, view-invariant generation. The work highlights the practicality of diffusion transformers for PHIG, offering efficiency, scalability, and strong empirical performance with robust generalization.

Abstract

We present X-MDPT ($\underline{Cross}$-view $\underline{M}$asked $\underline{D}$iffusion $\underline{P}$rediction $\underline{T}$ransformers), a novel diffusion model designed for pose-guided human image generation. X-MDPT distinguishes itself by employing masked diffusion transformers that operate on latent patches, a departure from the commonly-used Unet structures in existing works. The model comprises three key modules: 1) a denoising diffusion Transformer, 2) an aggregation network that consolidates conditions into a single vector for the diffusion process, and 3) a mask cross-prediction module that enhances representation learning with semantic information from the reference image. X-MDPT demonstrates scalability, improving FID, SSIM, and LPIPS with larger models. Despite its simple design, our model outperforms state-of-the-art approaches on the DeepFashion dataset while exhibiting efficiency in terms of training parameters, training time, and inference speed. Our compact 33MB model achieves an FID of 7.42, surpassing a prior Unet latent diffusion approach (FID 8.07) using only $11\times$ fewer parameters. Our best model surpasses the pixel-based diffusion with $\frac{2}{3}$ of the parameters and achieves $5.43 \times$ faster inference. The code is available at https://github.com/trungpx/xmdpt.

Cross-view Masked Diffusion Transformers for Person Image Synthesis

TL;DR

X-MDPT introduces a cross-view masked diffusion transformer for pose-guided human image synthesis, shifting from Unet to latent-patch transformers and adding CANet for unified conditioning and MIPNet for cross-view mask prediction. The model achieves state-of-the-art results on DeepFashion with a compact 33MB footprint and significantly faster inference than pixel-based methods, while maintaining high visual fidelity and view-consistency across poses. Through CANet, MIPNet, and CFG, the approach effectively fuses pose, local source, and global source information and learns cross-view correspondences, enabling stable, view-invariant generation. The work highlights the practicality of diffusion transformers for PHIG, offering efficiency, scalability, and strong empirical performance with robust generalization.

Abstract

We present X-MDPT (-view asked iffusion rediction ransformers), a novel diffusion model designed for pose-guided human image generation. X-MDPT distinguishes itself by employing masked diffusion transformers that operate on latent patches, a departure from the commonly-used Unet structures in existing works. The model comprises three key modules: 1) a denoising diffusion Transformer, 2) an aggregation network that consolidates conditions into a single vector for the diffusion process, and 3) a mask cross-prediction module that enhances representation learning with semantic information from the reference image. X-MDPT demonstrates scalability, improving FID, SSIM, and LPIPS with larger models. Despite its simple design, our model outperforms state-of-the-art approaches on the DeepFashion dataset while exhibiting efficiency in terms of training parameters, training time, and inference speed. Our compact 33MB model achieves an FID of 7.42, surpassing a prior Unet latent diffusion approach (FID 8.07) using only fewer parameters. Our best model surpasses the pixel-based diffusion with of the parameters and achieves faster inference. The code is available at https://github.com/trungpx/xmdpt.
Paper Structure (24 sections, 6 equations, 28 figures, 5 tables)

This paper contains 24 sections, 6 equations, 28 figures, 5 tables.

Figures (28)

  • Figure 1: FID score of SOTAs approaches on the DeepFashion dataset. Our transformer-based models, X-MDPT (size of S, B, L) are marked in stars. X-MDPT-S surpasses the latent Unet-based PoCoLD with only $11\times$ fewer parameters.
  • Figure 2: Source-View Invariant. The 2$^{\text{nd}}$ and 7$^{\text{th}}$ columns display different views of the same individuals from the DeepFashion. PIDM yields inconsistent outputs if varying source image views, whereas ours produces consistent ones closer to the ground truth. Best view at 200% zoom.
  • Figure 3: Overview of Our X-MDPT framework, built on transformers, facilitates pose-guided human image generation. During training, we randomly mask target image tokens at a 30% ratio. The noisy target image is then processed through the Transformer Diffusion network, conditioned on the aggregated vector (with $D=768$ for our X-MDPT-B model) via AdaLN modulation peebles2023scalable. Concurrently, we train a mask prediction objective alongside our novel mask inter-prediction network to capture semantics between source and target images when predicting mask tokens. The red arrow$\color{red}\rightarrow$ signifies the training-only branch, discarded during inference, while the bue arrow$\color{blue}\rightarrow$ serves both training and inference purposes. "Cond. Agg." denotes "Conditions Aggregation" on the bottom right. VAE is omitted for simplicity.
  • Figure 4: MIPNet vs. MDT. Ours MIPNet predicts masked tokens by using all tokens from the reference image $x_s$.
  • Figure 5: CANet. Views of same person have 99.99% similarity.
  • ...and 23 more figures