HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance
Guian Fang, Wenbiao Yan, Yuanfan Guo, Jianhua Han, Zutao Jiang, Hang Xu, Shengcai Liao, Xiaodan Liang
TL;DR
This work addresses the persistent problem of distorted limbs in text-to-image diffusion by introducing AbHuman, the first large-scale benchmark for anatomical abnormalities in synthesized humans, and by proposing HumanRefiner, a plug-and-play coarse-to-fine refinement pipeline that leverages AbHuman signals. AbHuman provides bounding-box level annotations for 18 anomaly categories across 56K images, enabling both abnormal scoring and detector-based refinement. HumanRefiner combines abnormal guidance, negative prompting, and pose-reversible mechanisms to achieve substantial improvements, including a 2.9× gain in limb quality over SDXL and 1.4× over DALL-E 3 in human evaluations. The approach enables anomaly-aware generation with practical implications for safer and more reliable human image synthesis, while also highlighting avenues for faster inference and richer scene annotations in future work.
Abstract
Text-to-image diffusion models have significantly advanced in conditional image generation. However, these models usually struggle with accurately rendering images featuring humans, resulting in distorted limbs and other anomalies. This issue primarily stems from the insufficient recognition and evaluation of limb qualities in diffusion models. To address this issue, we introduce AbHuman, the first large-scale synthesized human benchmark focusing on anatomical anomalies. This benchmark consists of 56K synthesized human images, each annotated with detailed, bounding-box level labels identifying 147K human anomalies in 18 different categories. Based on this, the recognition of human anomalies can be established, which in turn enhances image generation through traditional techniques such as negative prompting and guidance. To further boost the improvement, we propose HumanRefiner, a novel plug-and-play approach for the coarse-to-fine refinement of human anomalies in text-to-image generation. Specifically, HumanRefiner utilizes a self-diagnostic procedure to detect and correct issues related to both coarse-grained abnormal human poses and fine-grained anomaly levels, facilitating pose-reversible diffusion generation. Experimental results on the AbHuman benchmark demonstrate that HumanRefiner significantly reduces generative discrepancies, achieving a 2.9x improvement in limb quality compared to the state-of-the-art open-source generator SDXL and a 1.4x improvement over DALL-E 3 in human evaluations. Our data and code are available at https://github.com/Enderfga/HumanRefiner.
