Table of Contents
Fetching ...

Evaluating and Predicting Distorted Human Body Parts for Generated Images

Lu Ma, Kaibo Cao, Hao Liang, Jiaxin Lin, Zhuang Li, Yuhong Liu, Jihong Zhang, Wentao Zhang, Bin Cui

TL;DR

This work tackles the pervasive issue of anatomical inaccuracy in AI-generated human images by introducing Distortion-5K, a curated dataset of ~4.7k images with annotated distorted regions, and ViT-HD, a Vision Transformer-based detector trained in a two-stage regime to localize distortions across body parts. It also presents the Human Distortion Benchmark, a 500-prompt evaluation suite used to assess four popular T2I models, revealing that nearly half of generated images contain distortions. The authors compare ViT-HD against multiple baselines (segmentation models, ViTs, and VLMs), demonstrating superior pixel-level distortion localization with $F1=0.899$ and $IoU=0.831$, and highlight the limitations of current models in fully guaranteeing anatomical fidelity. The work provides a practical toolkit for improving human figure generation fidelity and sets the stage for safer, more realistic AI-driven imagery, with resources and code to be released on GitHub.

Abstract

Recent advancements in text-to-image (T2I) models enable high-quality image synthesis, yet generating anatomically accurate human figures remains challenging. AI-generated images frequently exhibit distortions such as proliferated limbs, missing fingers, deformed extremities, or fused body parts. Existing evaluation metrics like Inception Score (IS) and Fréchet Inception Distance (FID) lack the granularity to detect these distortions, while human preference-based metrics focus on abstract quality assessments rather than anatomical fidelity. To address this gap, we establish the first standards for identifying human body distortions in AI-generated images and introduce Distortion-5K, a comprehensive dataset comprising 4,700 annotated images of normal and malformed human figures across diverse styles and distortion types. Based on this dataset, we propose ViT-HD, a Vision Transformer-based model tailored for detecting human body distortions in AI-generated images, which outperforms state-of-the-art segmentation models and visual language models, achieving an F1 score of 0.899 and IoU of 0.831 on distortion localization. Additionally, we construct the Human Distortion Benchmark with 500 human-centric prompts to evaluate four popular T2I models using trained ViT-HD, revealing that nearly 50\% of generated images contain distortions. This work pioneers a systematic approach to evaluating anatomical accuracy in AI-generated humans, offering tools to advance the fidelity of T2I models and their real-world applicability. The Distortion-5K dataset, trained ViT-HD will soon be released in our GitHub repository: \href{https://github.com/TheRoadQaQ/Predicting-Distortion}{https://github.com/TheRoadQaQ/Predicting-Distortion}.

Evaluating and Predicting Distorted Human Body Parts for Generated Images

TL;DR

This work tackles the pervasive issue of anatomical inaccuracy in AI-generated human images by introducing Distortion-5K, a curated dataset of ~4.7k images with annotated distorted regions, and ViT-HD, a Vision Transformer-based detector trained in a two-stage regime to localize distortions across body parts. It also presents the Human Distortion Benchmark, a 500-prompt evaluation suite used to assess four popular T2I models, revealing that nearly half of generated images contain distortions. The authors compare ViT-HD against multiple baselines (segmentation models, ViTs, and VLMs), demonstrating superior pixel-level distortion localization with and , and highlight the limitations of current models in fully guaranteeing anatomical fidelity. The work provides a practical toolkit for improving human figure generation fidelity and sets the stage for safer, more realistic AI-driven imagery, with resources and code to be released on GitHub.

Abstract

Recent advancements in text-to-image (T2I) models enable high-quality image synthesis, yet generating anatomically accurate human figures remains challenging. AI-generated images frequently exhibit distortions such as proliferated limbs, missing fingers, deformed extremities, or fused body parts. Existing evaluation metrics like Inception Score (IS) and Fréchet Inception Distance (FID) lack the granularity to detect these distortions, while human preference-based metrics focus on abstract quality assessments rather than anatomical fidelity. To address this gap, we establish the first standards for identifying human body distortions in AI-generated images and introduce Distortion-5K, a comprehensive dataset comprising 4,700 annotated images of normal and malformed human figures across diverse styles and distortion types. Based on this dataset, we propose ViT-HD, a Vision Transformer-based model tailored for detecting human body distortions in AI-generated images, which outperforms state-of-the-art segmentation models and visual language models, achieving an F1 score of 0.899 and IoU of 0.831 on distortion localization. Additionally, we construct the Human Distortion Benchmark with 500 human-centric prompts to evaluate four popular T2I models using trained ViT-HD, revealing that nearly 50\% of generated images contain distortions. This work pioneers a systematic approach to evaluating anatomical accuracy in AI-generated humans, offering tools to advance the fidelity of T2I models and their real-world applicability. The Distortion-5K dataset, trained ViT-HD will soon be released in our GitHub repository: \href{https://github.com/TheRoadQaQ/Predicting-Distortion}{https://github.com/TheRoadQaQ/Predicting-Distortion}.

Paper Structure

This paper contains 31 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Examples from our Distortion-5K. AI-generated human images frequently exhibit various distortions, including proliferation (first row), absence (second row), deformation (third row), fusion (fourth row), and the occurrence of multiple distortions within a single image (fifth row). We annotate the distorted body parts in these images, where the left image in each pair represents the original, and the right image features red masks indicating the distorted regions.
  • Figure 2: Analysis of our Distortion-5K dataset. Left: The rate of distorted human images. Mid: The distribution of distortion types. Right: The frequency of the relative distorted areas.
  • Figure 3: Examples of test images along with predicted distortion mask. Our ViT-HD predicts the distorted body parts in these images, where the left image in each pair represents the original, and the right image features red masks indicating the predicted distorted region.
  • Figure 4: Analysis of our Human Distortion Benchmark. Left: Distribution of word counts. Right: Words cloud.
  • Figure 5: Rate of undistorted AI-generated images on Human Distortion Benchmark.
  • ...and 2 more figures