Multi-focal Conditioned Latent Diffusion for Person Image Synthesis

Jiaqi Liu; Jichao Zhang; Paolo Rota; Nicu Sebe

Multi-focal Conditioned Latent Diffusion for Person Image Synthesis

Jiaqi Liu, Jichao Zhang, Paolo Rota, Nicu Sebe

TL;DR

This work tackles detail loss in pose-guided person image synthesis caused by latent diffusion model compression, especially for faces and garments. It introduces Multi-focal Conditioned Latent Diffusion (MCLD), a two-branch architecture that employs a Multi-focal Condition Aggregation (MFCA) module to fuse pose-invariant cues from face, texture maps, and appearance into the UNet via selective cross-attention. By combining embeddings from $I_{emb}$, $F_{emb}$, and $A_{emb}$ with a ReferenceNet for appearance and DensePose for pose guidance, the method achieves stronger identity preservation and texture fidelity, enabling flexible editing on the DeepFashion dataset. The approach demonstrates improved image quality and robust editing capabilities, offering practical gains for high-fidelity, editable fashion imagery using diffusion-based generation. The model relies on latent diffusion models (LDMs) and losses such as $\mathcal{L}_{mse}$ and $\mathcal{L}_{face}$ to steer refinement toward sensitive regions.

Abstract

The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation and has been widely employed for Pose-Guided Person Image Synthesis (PGPIS), yielding promising results. However, the compression process of LDM often results in the deterioration of details, particularly in sensitive areas such as facial features and clothing textures. In this paper, we propose a Multi-focal Conditioned Latent Diffusion (MCLD) method to address these limitations by conditioning the model on disentangled, pose-invariant features from these sensitive regions. Our approach utilizes a multi-focal condition aggregation module, which effectively integrates facial identity and texture-specific information, enhancing the model's ability to produce appearance realistic and identity-consistent images. Our method demonstrates consistent identity and appearance generation on the DeepFashion dataset and enables flexible person image editing due to its generation consistency. The code is available at https://github.com/jqliu09/mcld.

Multi-focal Conditioned Latent Diffusion for Person Image Synthesis

TL;DR

Abstract

Multi-focal Conditioned Latent Diffusion for Person Image Synthesis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)