Jointly Conditioned Diffusion Model for Multi-View Pose-Guided Person Image Synthesis
Chengyu Xie, Zhi Gong, Junchi Ren, Linkun Yu, Si Shen, Fei Shen, Xiaoyu Du
TL;DR
Pose-guided image synthesis often suffers from incomplete textures when conditioning on a single view and lacks explicit cross-view interaction. The paper introduces Jointly Conditioned Diffusion Model (JCDM), which combines an Appearance Prior Module (APM) that predicts a holistic, identity-preserving prior from sparse multi-view inputs with a Joint Conditional Injection (JCI) mechanism that fuses multi-view cues into the denoising backbone via cross-view interaction. The system is designed as a pair of plug-and-play components compatible with standard diffusion backbones and trained with dual objectives, achieving state-of-the-art fidelity and cross-view consistency on DeepFashion and an in-house video dataset. It supports a variable number of reference views and enables single-pass multi-view synthesis with reduced latency, making it practical for real-world content creation and virtual avatar applications.
Abstract
Pose-guided human image generation is limited by incomplete textures from single reference views and the absence of explicit cross-view interaction. We present jointly conditioned diffusion model (JCDM), a jointly conditioned diffusion framework that exploits multi-view priors. The appearance prior module (APM) infers a holistic identity preserving prior from incomplete references, and the joint conditional injection (JCI) mechanism fuses multi-view cues and injects shared conditioning into the denoising backbone to align identity, color, and texture across poses. JCDM supports a variable number of reference views and integrates with standard diffusion backbones with minimal and targeted architectural modifications. Experiments demonstrate state of the art fidelity and cross-view consistency.
