Table of Contents
Fetching ...

Multi-focal Conditioned Latent Diffusion for Person Image Synthesis

Jiaqi Liu, Jichao Zhang, Paolo Rota, Nicu Sebe

TL;DR

This work tackles detail loss in pose-guided person image synthesis caused by latent diffusion model compression, especially for faces and garments. It introduces Multi-focal Conditioned Latent Diffusion (MCLD), a two-branch architecture that employs a Multi-focal Condition Aggregation (MFCA) module to fuse pose-invariant cues from face, texture maps, and appearance into the UNet via selective cross-attention. By combining embeddings from $I_{emb}$, $F_{emb}$, and $A_{emb}$ with a ReferenceNet for appearance and DensePose for pose guidance, the method achieves stronger identity preservation and texture fidelity, enabling flexible editing on the DeepFashion dataset. The approach demonstrates improved image quality and robust editing capabilities, offering practical gains for high-fidelity, editable fashion imagery using diffusion-based generation. The model relies on latent diffusion models (LDMs) and losses such as $\mathcal{L}_{mse}$ and $\mathcal{L}_{face}$ to steer refinement toward sensitive regions.

Abstract

The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation and has been widely employed for Pose-Guided Person Image Synthesis (PGPIS), yielding promising results. However, the compression process of LDM often results in the deterioration of details, particularly in sensitive areas such as facial features and clothing textures. In this paper, we propose a Multi-focal Conditioned Latent Diffusion (MCLD) method to address these limitations by conditioning the model on disentangled, pose-invariant features from these sensitive regions. Our approach utilizes a multi-focal condition aggregation module, which effectively integrates facial identity and texture-specific information, enhancing the model's ability to produce appearance realistic and identity-consistent images. Our method demonstrates consistent identity and appearance generation on the DeepFashion dataset and enables flexible person image editing due to its generation consistency. The code is available at https://github.com/jqliu09/mcld.

Multi-focal Conditioned Latent Diffusion for Person Image Synthesis

TL;DR

This work tackles detail loss in pose-guided person image synthesis caused by latent diffusion model compression, especially for faces and garments. It introduces Multi-focal Conditioned Latent Diffusion (MCLD), a two-branch architecture that employs a Multi-focal Condition Aggregation (MFCA) module to fuse pose-invariant cues from face, texture maps, and appearance into the UNet via selective cross-attention. By combining embeddings from , , and with a ReferenceNet for appearance and DensePose for pose guidance, the method achieves stronger identity preservation and texture fidelity, enabling flexible editing on the DeepFashion dataset. The approach demonstrates improved image quality and robust editing capabilities, offering practical gains for high-fidelity, editable fashion imagery using diffusion-based generation. The model relies on latent diffusion models (LDMs) and losses such as and to steer refinement toward sensitive regions.

Abstract

The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation and has been widely employed for Pose-Guided Person Image Synthesis (PGPIS), yielding promising results. However, the compression process of LDM often results in the deterioration of details, particularly in sensitive areas such as facial features and clothing textures. In this paper, we propose a Multi-focal Conditioned Latent Diffusion (MCLD) method to address these limitations by conditioning the model on disentangled, pose-invariant features from these sensitive regions. Our approach utilizes a multi-focal condition aggregation module, which effectively integrates facial identity and texture-specific information, enhancing the model's ability to produce appearance realistic and identity-consistent images. Our method demonstrates consistent identity and appearance generation on the DeepFashion dataset and enables flexible person image editing due to its generation consistency. The code is available at https://github.com/jqliu09/mcld.

Paper Structure

This paper contains 13 sections, 7 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: (a) The VAE rombach2022high reconstruction deteriorates the detailed information of person images, especially the facial regions and complex textures. These issues worsen for the generated latent with small deviations. A small deviation $\epsilon=0.2$ is added to demonstrate the often case of generated latent. (b) Our methods preserve this detailed information better than other LDM-based methods by introducing multi-focal conditions.
  • Figure 2: The overall pipeline of our proposed Multi-focal Conditioned Diffusion Model. (a) Face regions and appearance regions are first extracted from the source person images; (b) multi-focal condition aggregation module $\phi$ is used to fuse the focal embeddings as $c_{emb}$; (c) ReferenceNet $\mathcal{R}$ is used to aggregate information from the appearance texture map, denoted as $c_{ref}$; (d) Densepose provides the pose control to be fused into UNet with noise by Pose Guider.
  • Figure 3: Qualitative Comparison with several state-of-the-art models on the Deepfashion dataset. The inputs to our models are the target pose $p_{t}$ and the source person image $\mathcal{I}$. From left to right the results are of NTED, CASD, PIDM, CFLD and ours respectively.
  • Figure 4: Qualitative ablation comparison. Refer to Tab. \ref{['tab::ablation']} for baseline settings.
  • Figure 5: Appearance editing results. Our method accepts flexible editing of given identities, poses, and clothes. This is achieved only by modifying some regions of conditions, and no need for any masking or further training.
  • ...and 12 more figures