Table of Contents
Fetching ...

Detecting Human Artifacts from Text-to-Image Models

Kaihong Wang, Lingzhi Zhang, Jianming Zhang

TL;DR

This work tackles human-related artifacts in text-to-image generation by creating the Human Artifact Dataset (HAD) and training specialized detectors (HADM) for local and global artifacts. The authors demonstrate that HADM generalizes across unseen generators and can guide diffusion model finetuning and automated inpainting to reduce artifacts, improving human structural coherence. They further validate the approach with extensive experiments, ablations, and a user study, and release both the dataset and models for broad use. The work offers a practical feedback loop to enhance synthetic image quality and provides robust benchmarks for evaluating human artifacts across diverse T2I models.

Abstract

Despite recent advancements, text-to-image generation models often produce images containing artifacts, especially in human figures. These artifacts appear as poorly generated human bodies, including distorted, missing, or extra body parts, leading to visual inconsistencies with typical human anatomy and greatly impairing overall fidelity. In this study, we address this challenge by curating Human Artifact Dataset (HAD), a diverse dataset specifically designed to localize human artifacts. HAD comprises over 37,000 images generated by several popular text-to-image models, annotated for human artifact localization. Using this dataset, we train the Human Artifact Detection Models (HADM), which can identify different artifacts across multiple generative domains and demonstrate strong generalization, even on images from unseen generators. Additionally, to further improve generators' perception of human structural coherence, we use the predictions from our HADM as feedback for diffusion model finetuning. Our experiments confirm a reduction in human artifacts in the resulting model. Furthermore, we showcase a novel application of our HADM in an iterative inpainting framework to correct human artifacts in arbitrary images directly, demonstrating its utility in improving image quality. Our dataset and detection models are available at: https://github.com/wangkaihong/HADM.

Detecting Human Artifacts from Text-to-Image Models

TL;DR

This work tackles human-related artifacts in text-to-image generation by creating the Human Artifact Dataset (HAD) and training specialized detectors (HADM) for local and global artifacts. The authors demonstrate that HADM generalizes across unseen generators and can guide diffusion model finetuning and automated inpainting to reduce artifacts, improving human structural coherence. They further validate the approach with extensive experiments, ablations, and a user study, and release both the dataset and models for broad use. The work offers a practical feedback loop to enhance synthetic image quality and provides robust benchmarks for evaluating human artifacts across diverse T2I models.

Abstract

Despite recent advancements, text-to-image generation models often produce images containing artifacts, especially in human figures. These artifacts appear as poorly generated human bodies, including distorted, missing, or extra body parts, leading to visual inconsistencies with typical human anatomy and greatly impairing overall fidelity. In this study, we address this challenge by curating Human Artifact Dataset (HAD), a diverse dataset specifically designed to localize human artifacts. HAD comprises over 37,000 images generated by several popular text-to-image models, annotated for human artifact localization. Using this dataset, we train the Human Artifact Detection Models (HADM), which can identify different artifacts across multiple generative domains and demonstrate strong generalization, even on images from unseen generators. Additionally, to further improve generators' perception of human structural coherence, we use the predictions from our HADM as feedback for diffusion model finetuning. Our experiments confirm a reduction in human artifacts in the resulting model. Furthermore, we showcase a novel application of our HADM in an iterative inpainting framework to correct human artifacts in arbitrary images directly, demonstrating its utility in improving image quality. Our dataset and detection models are available at: https://github.com/wangkaihong/HADM.

Paper Structure

This paper contains 31 sections, 6 equations, 23 figures, 7 tables.

Figures (23)

  • Figure 1: Comparison between our Human Artifact Detection Models (HADM) and state-of-the-art vision-language models in detecting human artifacts in an influential deepfake image circulating on social media during the 2024 U.S. presidential election. While advanced VL models fail to detect visible artifacts of the right human figure in the image (shown in the responses at the bottom), our models successfully identify and localize the distorted hands and the extra limb (top right). Source: https://farid.berkeley.edu/deepfakes2024election/.
  • Figure 2: Example annotations from different generators in Human Artifact Dataset.
  • Figure 3: Distribution of local (left) and global (right) artifacts by categories across four different image generators in our Human Artifact Dataset. From the figure, several key observations emerge: Local artifacts are significantly more common than global artifacts, particularly in the hands, across all generators. Generator-wise, SDXL produces the highest number of global artifacts, while DALLE-2 produces the highest number of local artifacts. Both DALLE-3 and Midjourney exhibit stronger human structural coherence, with fewer overall artifacts. DALLE-3 shows a slight advantage in avoiding global artifacts, whereas Midjourney performs marginally better in avoiding local artifacts.
  • Figure 4: Comparison of the AUC scores of HADM against baseline methods. ICL represents in-context learning.
  • Figure 5: Examples of predictions from our HADM considered mistakes during evaluation on SDXL (a), DALLE-3 (b), DALLE-2 (c), and Midjourney (d). FP: false positive, FN: false negative. Red bounding boxes represent the detected artifact with top prediction scores, blue bounding boxes represent other detected bounding boxes with the same class label.
  • ...and 18 more figures