Table of Contents
Fetching ...

HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance

Guian Fang, Wenbiao Yan, Yuanfan Guo, Jianhua Han, Zutao Jiang, Hang Xu, Shengcai Liao, Xiaodan Liang

TL;DR

This work addresses the persistent problem of distorted limbs in text-to-image diffusion by introducing AbHuman, the first large-scale benchmark for anatomical abnormalities in synthesized humans, and by proposing HumanRefiner, a plug-and-play coarse-to-fine refinement pipeline that leverages AbHuman signals. AbHuman provides bounding-box level annotations for 18 anomaly categories across 56K images, enabling both abnormal scoring and detector-based refinement. HumanRefiner combines abnormal guidance, negative prompting, and pose-reversible mechanisms to achieve substantial improvements, including a 2.9× gain in limb quality over SDXL and 1.4× over DALL-E 3 in human evaluations. The approach enables anomaly-aware generation with practical implications for safer and more reliable human image synthesis, while also highlighting avenues for faster inference and richer scene annotations in future work.

Abstract

Text-to-image diffusion models have significantly advanced in conditional image generation. However, these models usually struggle with accurately rendering images featuring humans, resulting in distorted limbs and other anomalies. This issue primarily stems from the insufficient recognition and evaluation of limb qualities in diffusion models. To address this issue, we introduce AbHuman, the first large-scale synthesized human benchmark focusing on anatomical anomalies. This benchmark consists of 56K synthesized human images, each annotated with detailed, bounding-box level labels identifying 147K human anomalies in 18 different categories. Based on this, the recognition of human anomalies can be established, which in turn enhances image generation through traditional techniques such as negative prompting and guidance. To further boost the improvement, we propose HumanRefiner, a novel plug-and-play approach for the coarse-to-fine refinement of human anomalies in text-to-image generation. Specifically, HumanRefiner utilizes a self-diagnostic procedure to detect and correct issues related to both coarse-grained abnormal human poses and fine-grained anomaly levels, facilitating pose-reversible diffusion generation. Experimental results on the AbHuman benchmark demonstrate that HumanRefiner significantly reduces generative discrepancies, achieving a 2.9x improvement in limb quality compared to the state-of-the-art open-source generator SDXL and a 1.4x improvement over DALL-E 3 in human evaluations. Our data and code are available at https://github.com/Enderfga/HumanRefiner.

HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance

TL;DR

This work addresses the persistent problem of distorted limbs in text-to-image diffusion by introducing AbHuman, the first large-scale benchmark for anatomical abnormalities in synthesized humans, and by proposing HumanRefiner, a plug-and-play coarse-to-fine refinement pipeline that leverages AbHuman signals. AbHuman provides bounding-box level annotations for 18 anomaly categories across 56K images, enabling both abnormal scoring and detector-based refinement. HumanRefiner combines abnormal guidance, negative prompting, and pose-reversible mechanisms to achieve substantial improvements, including a 2.9× gain in limb quality over SDXL and 1.4× over DALL-E 3 in human evaluations. The approach enables anomaly-aware generation with practical implications for safer and more reliable human image synthesis, while also highlighting avenues for faster inference and richer scene annotations in future work.

Abstract

Text-to-image diffusion models have significantly advanced in conditional image generation. However, these models usually struggle with accurately rendering images featuring humans, resulting in distorted limbs and other anomalies. This issue primarily stems from the insufficient recognition and evaluation of limb qualities in diffusion models. To address this issue, we introduce AbHuman, the first large-scale synthesized human benchmark focusing on anatomical anomalies. This benchmark consists of 56K synthesized human images, each annotated with detailed, bounding-box level labels identifying 147K human anomalies in 18 different categories. Based on this, the recognition of human anomalies can be established, which in turn enhances image generation through traditional techniques such as negative prompting and guidance. To further boost the improvement, we propose HumanRefiner, a novel plug-and-play approach for the coarse-to-fine refinement of human anomalies in text-to-image generation. Specifically, HumanRefiner utilizes a self-diagnostic procedure to detect and correct issues related to both coarse-grained abnormal human poses and fine-grained anomaly levels, facilitating pose-reversible diffusion generation. Experimental results on the AbHuman benchmark demonstrate that HumanRefiner significantly reduces generative discrepancies, achieving a 2.9x improvement in limb quality compared to the state-of-the-art open-source generator SDXL and a 1.4x improvement over DALL-E 3 in human evaluations. Our data and code are available at https://github.com/Enderfga/HumanRefiner.
Paper Structure (23 sections, 9 equations, 10 figures, 8 tables)

This paper contains 23 sections, 9 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: The synthesis of human images by SOTA text-to-image diffusion model SDXL SDXL still presents incorrect limb counts, twisted hands, and incorrect limb position problems. We propose HumanRefiner, a coarse-to-fine self-diagnosis pose/anomaly-reversible generation pipeline built on the AbHuman benchmark to eliminate the anomalies.
  • Figure 2: Data generation pipeline of our AbHuman dataset.
  • Figure 3: Ab-human annotations statistics.
  • Figure 4: Visualization of abnormal scores. red indicates images with high abnormal scores while green indicates images with low abnormal scores.
  • Figure 5: The visualization of limb detection in the test set using the fine-tuned AbHuman detection model. Red boxes and labels are used to annotate the abnormal limbs detected, and white boxes and labels are used to annotate the normal limbs detected.
  • ...and 5 more figures