Table of Contents
Fetching ...

Fusion Embedding for Pose-Guided Person Image Synthesis with Diffusion Model

Donghwna Lee, Kyungha Min, Kirok Kim, Seyoung Jeong, Jiwoo Jeong, Wooju Kim

TL;DR

Pose-guided person image synthesis with diffusion models often struggles to transfer semantic information from a source image to a target pose within a single-stage conditioning pipeline. FPDM addresses this by learning an explicit fusion embedding in a first stage that aligns source and pose with the target image embedding, then using this embedding as conditioning in a latent diffusion model in a second stage. The method demonstrates state-of-the-art performance on DeepFashion and RWTH Phoenix, with ablations showing that a strong second stage alone can approach SOTA and that the Source-Enhanced Pose Fusion variant improves robustness. The results indicate FPDM’s potential for high-fidelity, pose-consistent person image synthesis and its applicability to sign-language video generation, while highlighting areas for further refinement in fine-grained texture transfer.

Abstract

Pose-Guided Person Image Synthesis (PGPIS) aims to synthesize high-quality person images corresponding to target poses while preserving the appearance of the source image. Recently, PGPIS methods that use diffusion models have achieved competitive performance. Most approaches involve extracting representations of the target pose and source image and learning their relationships in the generative model's training process. This approach makes it difficult to learn the semantic relationships between the input and target images and complicates the model structure needed to enhance generation results. To address these issues, we propose Fusion embedding for PGPIS using a Diffusion Model (FPDM). Inspired by the successful application of pre-trained CLIP models in text-to-image diffusion models, our method consists of two stages. The first stage involves training the fusion embedding of the source image and target pose to align with the target image's embedding. In the second stage, the generative model uses this fusion embedding as a condition to generate the target image. We applied the proposed method to the benchmark datasets DeepFashion and RWTH-PHOENIX-Weather 2014T, and conducted both quantitative and qualitative evaluations, demonstrating state-of-the-art (SOTA) performance. An ablation study of the model structure showed that even a model using only the second stage achieved performance close to the other PGPIS SOTA models. The code is available at https://github.com/dhlee-work/FPDM.

Fusion Embedding for Pose-Guided Person Image Synthesis with Diffusion Model

TL;DR

Pose-guided person image synthesis with diffusion models often struggles to transfer semantic information from a source image to a target pose within a single-stage conditioning pipeline. FPDM addresses this by learning an explicit fusion embedding in a first stage that aligns source and pose with the target image embedding, then using this embedding as conditioning in a latent diffusion model in a second stage. The method demonstrates state-of-the-art performance on DeepFashion and RWTH Phoenix, with ablations showing that a strong second stage alone can approach SOTA and that the Source-Enhanced Pose Fusion variant improves robustness. The results indicate FPDM’s potential for high-fidelity, pose-consistent person image synthesis and its applicability to sign-language video generation, while highlighting areas for further refinement in fine-grained texture transfer.

Abstract

Pose-Guided Person Image Synthesis (PGPIS) aims to synthesize high-quality person images corresponding to target poses while preserving the appearance of the source image. Recently, PGPIS methods that use diffusion models have achieved competitive performance. Most approaches involve extracting representations of the target pose and source image and learning their relationships in the generative model's training process. This approach makes it difficult to learn the semantic relationships between the input and target images and complicates the model structure needed to enhance generation results. To address these issues, we propose Fusion embedding for PGPIS using a Diffusion Model (FPDM). Inspired by the successful application of pre-trained CLIP models in text-to-image diffusion models, our method consists of two stages. The first stage involves training the fusion embedding of the source image and target pose to align with the target image's embedding. In the second stage, the generative model uses this fusion embedding as a condition to generate the target image. We applied the proposed method to the benchmark datasets DeepFashion and RWTH-PHOENIX-Weather 2014T, and conducted both quantitative and qualitative evaluations, demonstrating state-of-the-art (SOTA) performance. An ablation study of the model structure showed that even a model using only the second stage achieved performance close to the other PGPIS SOTA models. The code is available at https://github.com/dhlee-work/FPDM.

Paper Structure

This paper contains 22 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Architecture of the Fusion Embedding for PGPIS with Diffusion Model.
  • Figure 2: Qualitative comparisons with current state-of-the-art models on the DeepFashion dataset. We highlighted areas of interest with red boxes to enhance visual understanding.
  • Figure 3: Visualization of the robustness of generated images to input variations.
  • Figure 4: Qualitative evaluation of first-stage ablation. cosine similarity-based ranking of testset.
  • Figure 5: Qualitative comparisons of second-stage ablation results. Areas of interest are highlighted with red boxes to enhance visual understanding.
  • ...and 2 more figures