Table of Contents
Fetching ...

HandEval: Taking the First Step Towards Hand Quality Evaluation in Generated Images

Zichuan Wang, Bo Peng, Songlin Yang, Zhenchen Tang, Jing Dong

TL;DR

The paper addresses the lack of hand-region quality evaluation in generated images, a critical detail affecting realism and AIGC detection. It introduces HandPair, the first 48k-hand dataset built from high-quality real hands and degraded low-quality counterparts, and HandEval, a hand-focused quality evaluator that fuses MLLM vision with hand keypoint priors. HandEval demonstrates superior alignment with human judgments and improves both hand generation (via HandEval-guided optimization) and AIGC detection (via a hand-quality fusion module) across multiple models and detectors. The work provides practical tools for improving local hand fidelity in generation and detection pipelines, with code and datasets to be released for community use. The approach advances localized IQA by integrating structural hand priors into multimodal evaluation, enabling more reliable hand-aware generation and forgery detection in real-world applications.

Abstract

Although recent text-to-image (T2I) models have significantly improved the overall visual quality of generated images, they still struggle in the generation of accurate details in complex local regions, especially human hands. Generated hands often exhibit structural distortions and unrealistic textures, which can be very noticeable even when the rest of the body is well-generated. However, the quality assessment of hand regions remains largely neglected, limiting downstream task performance like human-centric generation quality optimization and AIGC detection. To address this, we propose the first quality assessment task targeting generated hand regions and showcase its abundant downstream applications. We first introduce the HandPair dataset for training hand quality assessment models. It consists of 48k images formed by high- and low-quality hand pairs, enabling low-cost, efficient supervision without manual annotation. Based on it, we develop HandEval, a carefully designed hand-specific quality assessment model. It leverages the powerful visual understanding capability of Multimodal Large Language Model (MLLM) and incorporates prior knowledge of hand keypoints, gaining strong perception of hand quality. We further construct a human-annotated test set with hand images from various state-of-the-art (SOTA) T2I models to validate its quality evaluation capability. Results show that HandEval aligns better with human judgments than existing SOTA methods. Furthermore, we integrate HandEval into image generation and AIGC detection pipelines, prominently enhancing generated hand realism and detection accuracy, respectively, confirming its universal effectiveness in downstream applications. Code and dataset will be available.

HandEval: Taking the First Step Towards Hand Quality Evaluation in Generated Images

TL;DR

The paper addresses the lack of hand-region quality evaluation in generated images, a critical detail affecting realism and AIGC detection. It introduces HandPair, the first 48k-hand dataset built from high-quality real hands and degraded low-quality counterparts, and HandEval, a hand-focused quality evaluator that fuses MLLM vision with hand keypoint priors. HandEval demonstrates superior alignment with human judgments and improves both hand generation (via HandEval-guided optimization) and AIGC detection (via a hand-quality fusion module) across multiple models and detectors. The work provides practical tools for improving local hand fidelity in generation and detection pipelines, with code and datasets to be released for community use. The approach advances localized IQA by integrating structural hand priors into multimodal evaluation, enabling more reliable hand-aware generation and forgery detection in real-world applications.

Abstract

Although recent text-to-image (T2I) models have significantly improved the overall visual quality of generated images, they still struggle in the generation of accurate details in complex local regions, especially human hands. Generated hands often exhibit structural distortions and unrealistic textures, which can be very noticeable even when the rest of the body is well-generated. However, the quality assessment of hand regions remains largely neglected, limiting downstream task performance like human-centric generation quality optimization and AIGC detection. To address this, we propose the first quality assessment task targeting generated hand regions and showcase its abundant downstream applications. We first introduce the HandPair dataset for training hand quality assessment models. It consists of 48k images formed by high- and low-quality hand pairs, enabling low-cost, efficient supervision without manual annotation. Based on it, we develop HandEval, a carefully designed hand-specific quality assessment model. It leverages the powerful visual understanding capability of Multimodal Large Language Model (MLLM) and incorporates prior knowledge of hand keypoints, gaining strong perception of hand quality. We further construct a human-annotated test set with hand images from various state-of-the-art (SOTA) T2I models to validate its quality evaluation capability. Results show that HandEval aligns better with human judgments than existing SOTA methods. Furthermore, we integrate HandEval into image generation and AIGC detection pipelines, prominently enhancing generated hand realism and detection accuracy, respectively, confirming its universal effectiveness in downstream applications. Code and dataset will be available.

Paper Structure

This paper contains 34 sections, 11 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Overview of hand quality assessment and its importance for downstream tasks. Hand quality assessment is crucial for ensuring realism in generating human-centric content and detecting AI-generated content (AIGC) where hand artifacts often reveal forgeries. While existing Image Quality Assessment (IQA) methods focus solely on global image quality, they neglect critical hand-specific details and struggle to evaluate hand quality, limiting their application in downstream tasks. Therefore, we propose the first systematic approach to fill the gap of hand-specific quality assessment, achieving high consistency with human ratings as well as improving generation quality and AIGC detection performance.
  • Figure 2: Example image pairs from the HandPair dataset. Each pair consists of a high-quality hand image and its corresponding low-quality version generated using hand-specific inpainting methods. The low-quality images exhibit typical visual defects in generated hands, which are categorized into six types: structural missing, structural redundancy, proportional distortion, structural deformation, structural fusion and unrealistic texture.
  • Figure 3: Distribution of finger flexion angles computed from PIP joints. A smaller angle indicates a more bent finger. A higher proportion of large-angle samples ( 150° and above) is observed, which can be attributed to frequent hand-object interaction scenarios where fingers tend to be more extended in our dataset. Overall, the distributions confirm diverse finger poses across the dataset, with the thumb exhibiting consistently larger angles due to its anatomical characteristics.
  • Figure 4: Palm orientation distribution (left) and proportion of each hand image defect type (right). The palm orientation distribution shows a relatively uniform angular coverage from 0° to 180°, indicating diverse viewpoints without significant bias. The sample counts across six defect categories are relatively balanced, ensuring diversity and representativeness of the generated low-quality hand images.
  • Figure 5: The overall architecture of HandEval. It incorporates hand keypoint priors to enhance quality assessment. Hand images are processed through a visual encoder, while structural priors in the form of hand keypoints are encoded via a GCN-based keypoint encoder to capture spatial and structural information. These visual and keypoint features are then fused through a cross-attention mechanism. Finally, guided by a text-based prompt, a pretrained language model performs quality assessment and outputs the final HandScore.
  • ...and 7 more figures