DRDM: A Disentangled Representations Diffusion Model for Synthesizing Realistic Person Images
Enbo Huang, Yuan Zhang, Faliang Huang, Guangyu Zhang, Yang Liu
TL;DR
This work introduces DRDM, a diffusion-based framework for controllable person image synthesis that decouples structure and texture across body parts to robustly transfer pose and appearance from a source image. Key innovations include the body-part subspace decoupling block (BSDB) with self-attention, a pose encoder with a high-dimensional latent for guidance, and a parsing-map-driven, classifier-free diffusion sampling strategy (PMDCF) that reinforces texture and pose conditioning. The approach yields state-of-the-art results on the DeepFashion dataset, with superior SSIM and LPIPS, competitive FID, and favorable perceptual user-study outcomes, while mitigating occlusion, limb distortion, and garment style deviations. The combination of disentangled texture fusion and parsing-map based sampling provides precise control over appearance and pose, offering practical benefits for virtual try-on, editing, and video production.
Abstract
Person image synthesis with controllable body poses and appearances is an essential task owing to the practical needs in the context of virtual try-on, image editing and video production. However, existing methods face significant challenges with details missing, limbs distortion and the garment style deviation. To address these issues, we propose a Disentangled Representations Diffusion Model (DRDM) to generate photo-realistic images from source portraits in specific desired poses and appearances. First, a pose encoder is responsible for encoding pose features into a high-dimensional space to guide the generation of person images. Second, a body-part subspace decoupling block (BSDB) disentangles features from the different body parts of a source figure and feeds them to the various layers of the noise prediction block, thereby supplying the network with rich disentangled features for generating a realistic target image. Moreover, during inference, we develop a parsing map-based disentangled classifier-free guided sampling method, which amplifies the conditional signals of texture and pose. Extensive experimental results on the Deepfashion dataset demonstrate the effectiveness of our approach in achieving pose transfer and appearance control.
