HandEval: Taking the First Step Towards Hand Quality Evaluation in Generated Images
Zichuan Wang, Bo Peng, Songlin Yang, Zhenchen Tang, Jing Dong
TL;DR
The paper addresses the lack of hand-region quality evaluation in generated images, a critical detail affecting realism and AIGC detection. It introduces HandPair, the first 48k-hand dataset built from high-quality real hands and degraded low-quality counterparts, and HandEval, a hand-focused quality evaluator that fuses MLLM vision with hand keypoint priors. HandEval demonstrates superior alignment with human judgments and improves both hand generation (via HandEval-guided optimization) and AIGC detection (via a hand-quality fusion module) across multiple models and detectors. The work provides practical tools for improving local hand fidelity in generation and detection pipelines, with code and datasets to be released for community use. The approach advances localized IQA by integrating structural hand priors into multimodal evaluation, enabling more reliable hand-aware generation and forgery detection in real-world applications.
Abstract
Although recent text-to-image (T2I) models have significantly improved the overall visual quality of generated images, they still struggle in the generation of accurate details in complex local regions, especially human hands. Generated hands often exhibit structural distortions and unrealistic textures, which can be very noticeable even when the rest of the body is well-generated. However, the quality assessment of hand regions remains largely neglected, limiting downstream task performance like human-centric generation quality optimization and AIGC detection. To address this, we propose the first quality assessment task targeting generated hand regions and showcase its abundant downstream applications. We first introduce the HandPair dataset for training hand quality assessment models. It consists of 48k images formed by high- and low-quality hand pairs, enabling low-cost, efficient supervision without manual annotation. Based on it, we develop HandEval, a carefully designed hand-specific quality assessment model. It leverages the powerful visual understanding capability of Multimodal Large Language Model (MLLM) and incorporates prior knowledge of hand keypoints, gaining strong perception of hand quality. We further construct a human-annotated test set with hand images from various state-of-the-art (SOTA) T2I models to validate its quality evaluation capability. Results show that HandEval aligns better with human judgments than existing SOTA methods. Furthermore, we integrate HandEval into image generation and AIGC detection pipelines, prominently enhancing generated hand realism and detection accuracy, respectively, confirming its universal effectiveness in downstream applications. Code and dataset will be available.
