Table of Contents
Fetching ...

Jointly Conditioned Diffusion Model for Multi-View Pose-Guided Person Image Synthesis

Chengyu Xie, Zhi Gong, Junchi Ren, Linkun Yu, Si Shen, Fei Shen, Xiaoyu Du

TL;DR

Pose-guided image synthesis often suffers from incomplete textures when conditioning on a single view and lacks explicit cross-view interaction. The paper introduces Jointly Conditioned Diffusion Model (JCDM), which combines an Appearance Prior Module (APM) that predicts a holistic, identity-preserving prior from sparse multi-view inputs with a Joint Conditional Injection (JCI) mechanism that fuses multi-view cues into the denoising backbone via cross-view interaction. The system is designed as a pair of plug-and-play components compatible with standard diffusion backbones and trained with dual objectives, achieving state-of-the-art fidelity and cross-view consistency on DeepFashion and an in-house video dataset. It supports a variable number of reference views and enables single-pass multi-view synthesis with reduced latency, making it practical for real-world content creation and virtual avatar applications.

Abstract

Pose-guided human image generation is limited by incomplete textures from single reference views and the absence of explicit cross-view interaction. We present jointly conditioned diffusion model (JCDM), a jointly conditioned diffusion framework that exploits multi-view priors. The appearance prior module (APM) infers a holistic identity preserving prior from incomplete references, and the joint conditional injection (JCI) mechanism fuses multi-view cues and injects shared conditioning into the denoising backbone to align identity, color, and texture across poses. JCDM supports a variable number of reference views and integrates with standard diffusion backbones with minimal and targeted architectural modifications. Experiments demonstrate state of the art fidelity and cross-view consistency.

Jointly Conditioned Diffusion Model for Multi-View Pose-Guided Person Image Synthesis

TL;DR

Pose-guided image synthesis often suffers from incomplete textures when conditioning on a single view and lacks explicit cross-view interaction. The paper introduces Jointly Conditioned Diffusion Model (JCDM), which combines an Appearance Prior Module (APM) that predicts a holistic, identity-preserving prior from sparse multi-view inputs with a Joint Conditional Injection (JCI) mechanism that fuses multi-view cues into the denoising backbone via cross-view interaction. The system is designed as a pair of plug-and-play components compatible with standard diffusion backbones and trained with dual objectives, achieving state-of-the-art fidelity and cross-view consistency on DeepFashion and an in-house video dataset. It supports a variable number of reference views and enables single-pass multi-view synthesis with reduced latency, making it practical for real-world content creation and virtual avatar applications.

Abstract

Pose-guided human image generation is limited by incomplete textures from single reference views and the absence of explicit cross-view interaction. We present jointly conditioned diffusion model (JCDM), a jointly conditioned diffusion framework that exploits multi-view priors. The appearance prior module (APM) infers a holistic identity preserving prior from incomplete references, and the joint conditional injection (JCI) mechanism fuses multi-view cues and injects shared conditioning into the denoising backbone to align identity, color, and texture across poses. JCDM supports a variable number of reference views and integrates with standard diffusion backbones with minimal and targeted architectural modifications. Experiments demonstrate state of the art fidelity and cross-view consistency.

Paper Structure

This paper contains 12 sections, 4 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of proposed JCDM. JCI encodes multiple reference views and target poses into a unified nine channel latent for the denoising UNet. APM infers an identity preserving appearance prior from the same inputs. The prior and the latent jointly condition the UNet to render consistent high fidelity images across poses.
  • Figure 2: Qualitative comparison with SOTA methods.
  • Figure 3: User study results.
  • Figure 4: Learning curve of JCDM over 100k steps. Accuracy, measured by average cosine similarity, improves steadily and converges.
  • Figure 5: Ablation study results, showing the progressive visual improvements from baseline (B0) to JCDM (Ours).
  • ...and 2 more figures