Table of Contents
Fetching ...

ArtifactLens: Hundreds of Labels Are Enough for Artifact Detection with VLMs

James Burgess, Rameen Abdal, Dan Stoddart, Sergey Tulyakov, Serena Yeung-Levy, Kuan-Chieh Jackson Wang

TL;DR

ArtifactLens demonstrates that pretrained vision-language models can effectively detect artifacts in AI-generated images when augmented with a data-efficient scaffolding framework. By decomposing the task into specialized subproblems, cropping to informative regions, and optimizing prompts and demonstrations through in-context learning and full-spectrum prompting, the approach achieves state-of-the-art performance across five human-artifact benchmarks with hundreds of labeled examples. Key contributions include counterfactual demonstrations for ICL and full-spectrum prompting for text optimization, plus a robust multi-component architecture that preserves data efficiency while enabling generalization to non-human artifacts and AIGC detection. The findings suggest a practical path to scalable artifact detection that reduces labeling burdens and can adapt to evolving artifact taxonomies, with broad implications for benchmarking, reward modeling, and AI-regulation workflows.

Abstract

Modern image generators produce strikingly realistic images, where only artifacts like distorted hands or warped objects reveal their synthetic origin. Detecting these artifacts is essential: without detection, we cannot benchmark generators or train reward models to improve them. Current detectors fine-tune VLMs on tens of thousands of labeled images, but this is expensive to repeat whenever generators evolve or new artifact types emerge. We show that pretrained VLMs already encode the knowledge needed to detect artifacts - with the right scaffolding, this capability can be unlocked using only a few hundred labeled examples per artifact category. Our system, ArtifactLens, achieves state-of-the-art on five human artifact benchmarks (the first evaluation across multiple datasets) while requiring orders of magnitude less labeled data. The scaffolding consists of a multi-component architecture with in-context learning and text instruction optimization, with novel improvements to each. Our methods generalize to other artifact types - object morphology, animal anatomy, and entity interactions - and to the distinct task of AIGC detection.

ArtifactLens: Hundreds of Labels Are Enough for Artifact Detection with VLMs

TL;DR

ArtifactLens demonstrates that pretrained vision-language models can effectively detect artifacts in AI-generated images when augmented with a data-efficient scaffolding framework. By decomposing the task into specialized subproblems, cropping to informative regions, and optimizing prompts and demonstrations through in-context learning and full-spectrum prompting, the approach achieves state-of-the-art performance across five human-artifact benchmarks with hundreds of labeled examples. Key contributions include counterfactual demonstrations for ICL and full-spectrum prompting for text optimization, plus a robust multi-component architecture that preserves data efficiency while enabling generalization to non-human artifacts and AIGC detection. The findings suggest a practical path to scalable artifact detection that reduces labeling burdens and can adapt to evolving artifact taxonomies, with broad implications for benchmarking, reward modeling, and AI-regulation workflows.

Abstract

Modern image generators produce strikingly realistic images, where only artifacts like distorted hands or warped objects reveal their synthetic origin. Detecting these artifacts is essential: without detection, we cannot benchmark generators or train reward models to improve them. Current detectors fine-tune VLMs on tens of thousands of labeled images, but this is expensive to repeat whenever generators evolve or new artifact types emerge. We show that pretrained VLMs already encode the knowledge needed to detect artifacts - with the right scaffolding, this capability can be unlocked using only a few hundred labeled examples per artifact category. Our system, ArtifactLens, achieves state-of-the-art on five human artifact benchmarks (the first evaluation across multiple datasets) while requiring orders of magnitude less labeled data. The scaffolding consists of a multi-component architecture with in-context learning and text instruction optimization, with novel improvements to each. Our methods generalize to other artifact types - object morphology, animal anatomy, and entity interactions - and to the distinct task of AIGC detection.
Paper Structure (31 sections, 10 figures, 7 tables)

This paper contains 31 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: The scaffolding methods in ArtifactLens. Left: a multi-component architecture, where each specialist leverages pretrained VLMs to classify a single error like 'leg artifact'. The specialists use a crop tool to zoom to regions-of-interest for easier visual understanding wu2024v. Middle: To optimize the pretrained VLMs, in-context learning uses prompts with task demonstrations, which are image-label pairs. The challenge is choosing the best task demonstrations -- our final system does retrieval-based selection. Right: We also optimize the pretrained VLM with text prompt optimization. A concise seed instruction is passed to an LLM which generates candidate text prompts. The prompts are evaluated against a development dataset and the results are fed back to an LLM for rewriting.
  • Figure 2: Counterfactual demonstrations for in-context learning (ICL) (\ref{['sec:results-icl']}): typical ICL methods may not consider the relationship between different demonstrations. We choose demonstrations in pairs where the images are semantically similar, but with opposite artifact label -- this more clearly defines the learning task.
  • Figure 3: In text optimization, LLMs map a task description (yellow) to candidate text prompts (clear boxes). We add hints (blue) that cover the 'full spectrum' of confidence thresholds. Without this, most generated prompts are cautious -- instructing the VLM to only flag errors if confident -- which leads to worse artifact detection performance.
  • Figure 4: Performance of ArtifactLens on other (non-human-artifact) tasks, showing the generality of the methods (\ref{['sec:results_other_detection_tasks']}).
  • Figure 6: 'Controversial' images in the human study. When classifying for 'hand artifact', five annotators set yes, and five annotators said no.
  • ...and 5 more figures