Table of Contents
Fetching ...

RePoseDM: Recurrent Pose Alignment and Gradient Guidance for Pose Guided Image Synthesis

Anant Khandelwal

TL;DR

This work introduces RePoseDM, a diffusion-based framework for pose-guided image synthesis that achieves photorealistic rendering by combining Recurrent Pose Alignment with gradient guidance from Pose Interaction Fields. The recurrent module warps source appearance features across pose-conditioned iterations, while the gradient guidance constrains predicted poses to stay on valid manifolds, reducing pose leakage and alignment errors. Extensive experiments on DeepFashion, Market-1501, and HumanArt show state-of-the-art quantitative and perceptual performance, with ablations confirming the contributions of both recurrent alignment and gradient guidance. The approach also enhances downstream tasks like person re-identification through pose-consistent data augmentation, highlighting practical impact for virtual try-on, AR, and related applications.

Abstract

Pose-guided person image synthesis task requires re-rendering a reference image, which should have a photorealistic appearance and flawless pose transfer. Since person images are highly structured, existing approaches require dense connections for complex deformations and occlusions because these are generally handled through multi-level warping and masking in latent space. The feature maps generated by convolutional neural networks do not have equivariance, and hence multi-level warping is required to perform pose alignment. Inspired by the ability of the diffusion model to generate photorealistic images from the given conditional guidance, we propose recurrent pose alignment to provide pose-aligned texture features as conditional guidance. Due to the leakage of the source pose in conditional guidance, we propose gradient guidance from pose interaction fields, which output the distance from the valid pose manifold given a predicted pose as input. This helps in learning plausible pose transfer trajectories that result in photorealism and undistorted texture details. Extensive results on two large-scale benchmarks and a user study demonstrate the ability of our proposed approach to generate photorealistic pose transfer under challenging scenarios. Additionally, we demonstrate the efficiency of gradient guidance in pose-guided image generation on the HumanArt dataset with fine-tuned stable diffusion.

RePoseDM: Recurrent Pose Alignment and Gradient Guidance for Pose Guided Image Synthesis

TL;DR

This work introduces RePoseDM, a diffusion-based framework for pose-guided image synthesis that achieves photorealistic rendering by combining Recurrent Pose Alignment with gradient guidance from Pose Interaction Fields. The recurrent module warps source appearance features across pose-conditioned iterations, while the gradient guidance constrains predicted poses to stay on valid manifolds, reducing pose leakage and alignment errors. Extensive experiments on DeepFashion, Market-1501, and HumanArt show state-of-the-art quantitative and perceptual performance, with ablations confirming the contributions of both recurrent alignment and gradient guidance. The approach also enhances downstream tasks like person re-identification through pose-consistent data augmentation, highlighting practical impact for virtual try-on, AR, and related applications.

Abstract

Pose-guided person image synthesis task requires re-rendering a reference image, which should have a photorealistic appearance and flawless pose transfer. Since person images are highly structured, existing approaches require dense connections for complex deformations and occlusions because these are generally handled through multi-level warping and masking in latent space. The feature maps generated by convolutional neural networks do not have equivariance, and hence multi-level warping is required to perform pose alignment. Inspired by the ability of the diffusion model to generate photorealistic images from the given conditional guidance, we propose recurrent pose alignment to provide pose-aligned texture features as conditional guidance. Due to the leakage of the source pose in conditional guidance, we propose gradient guidance from pose interaction fields, which output the distance from the valid pose manifold given a predicted pose as input. This helps in learning plausible pose transfer trajectories that result in photorealism and undistorted texture details. Extensive results on two large-scale benchmarks and a user study demonstrate the ability of our proposed approach to generate photorealistic pose transfer under challenging scenarios. Additionally, we demonstrate the efficiency of gradient guidance in pose-guided image generation on the HumanArt dataset with fine-tuned stable diffusion.
Paper Structure (18 sections, 17 equations, 11 figures, 3 tables)

This paper contains 18 sections, 17 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: RePoseDM: Architecture of our proposed U-Net based Diffusion Model with Recurrent Pose Alignment and Gradient Guidance from Pose Interaction Fields. Warped source appearance features are fed to U-Net using cross-attention.
  • Figure 2: Qualitative comparison of several SOTA methods on the DeepFashion dataset. The inputs shown are target pose and source image, ground truth shows the image in target pose. Images generated from several methods are shown next. Ours indicate RePoseDM
  • Figure 3: Qualitative comparision from several SOTA methods are shown on Market-1501 dataset. Ours indicate RePoseDM
  • Figure 4: Editing Capability of RePoseDM by controlling garment appearance of target image using source image.
  • Figure 5: Qualitative comparison with ablation baselines B1 & B2
  • ...and 6 more figures