Table of Contents
Fetching ...

Boost Your Human Image Generation Model via Direct Preference Optimization

Sanghyeon Na, Yonggyu Kim, Hyunjoon Lee

TL;DR

HG-DPO tackles the realism gap in human image generation by reframing Direct Preference Optimization to use real images as winning examples and integrating a three-stage curriculum. A new image-pool strategy, SDRecon-based intermediate domains, and a statistics-matching loss bridge the domain gap between real and generated imagery, enabling higher realism and better image-text alignment. The method also supports personalized text-to-image generation via LoRA-enabled adaptation, maintaining identity fidelity while improving quality. Comprehensive experiments and ablations demonstrate significant gains over prior DPO methods and baselines, with practical implications for creative and media applications, albeit with noted limitations such as finger realism and broader societal considerations.

Abstract

Human image generation is a key focus in image synthesis due to its broad applications, but even slight inaccuracies in anatomy, pose, or details can compromise realism. To address these challenges, we explore Direct Preference Optimization (DPO), which trains models to generate preferred (winning) images while diverging from non-preferred (losing) ones. However, conventional DPO methods use generated images as winning images, limiting realism. To overcome this limitation, we propose an enhanced DPO approach that incorporates high-quality real images as winning images, encouraging outputs to resemble real images rather than generated ones. However, implementing this concept is not a trivial task. Therefore, our approach, HG-DPO (Human image Generation through DPO), employs a novel curriculum learning framework that gradually improves the output of the model toward greater realism, making training more feasible. Furthermore, HG-DPO effectively adapts to personalized text-to-image tasks, generating high-quality and identity-specific images, which highlights the practical value of our approach.

Boost Your Human Image Generation Model via Direct Preference Optimization

TL;DR

HG-DPO tackles the realism gap in human image generation by reframing Direct Preference Optimization to use real images as winning examples and integrating a three-stage curriculum. A new image-pool strategy, SDRecon-based intermediate domains, and a statistics-matching loss bridge the domain gap between real and generated imagery, enabling higher realism and better image-text alignment. The method also supports personalized text-to-image generation via LoRA-enabled adaptation, maintaining identity fidelity while improving quality. Comprehensive experiments and ablations demonstrate significant gains over prior DPO methods and baselines, with practical implications for creative and media applications, albeit with noted limitations such as finger realism and broader societal considerations.

Abstract

Human image generation is a key focus in image synthesis due to its broad applications, but even slight inaccuracies in anatomy, pose, or details can compromise realism. To address these challenges, we explore Direct Preference Optimization (DPO), which trains models to generate preferred (winning) images while diverging from non-preferred (losing) ones. However, conventional DPO methods use generated images as winning images, limiting realism. To overcome this limitation, we propose an enhanced DPO approach that incorporates high-quality real images as winning images, encouraging outputs to resemble real images rather than generated ones. However, implementing this concept is not a trivial task. Therefore, our approach, HG-DPO (Human image Generation through DPO), employs a novel curriculum learning framework that gradually improves the output of the model toward greater realism, making training more feasible. Furthermore, HG-DPO effectively adapts to personalized text-to-image tasks, generating high-quality and identity-specific images, which highlights the practical value of our approach.
Paper Structure (69 sections, 11 equations, 25 figures, 9 tables)

This paper contains 69 sections, 11 equations, 25 figures, 9 tables.

Figures (25)

  • Figure 1: Top: HG-DPO generates high-quality human images that encompass a wide range of actions, appearances, group sizes, and backgrounds. Bottom left: This is because HG-DPO improves the base model to generate images with more realistic anatomical features and poses, while also better aligning with the prompt (red text in the prompt). Bottom right: The benefits of HG-DPO transfer to personalized text-to-image tasks without additional training, generating high-quality images with the identity of concept image.
  • Figure 2: Three-stage training of HG-DPO. It progressively enhances the model's human image generation capabilities.
  • Figure 3: DPO Dataset for the easy stage. In the upper figure, $\mathcal{D}_\mathbb{E}$, constructed with AI rather than human feedback, shows winning images with superior features over losing images. A user study in the lower figure confirms this outcome.
  • Figure 4: Qualitative comparison with the previous methods. HG-DPO generates high-quality human images with more realistic compositions and poses, providing superior text alignment compared to the prior methods.
  • Figure 5: Qualitative progress.$\epsilon_{base}$ evolves as it progresses through each stage of the HG-DPO pipeline up to the hard stage.
  • ...and 20 more figures