ArtifactLens: Hundreds of Labels Are Enough for Artifact Detection with VLMs
James Burgess, Rameen Abdal, Dan Stoddart, Sergey Tulyakov, Serena Yeung-Levy, Kuan-Chieh Jackson Wang
TL;DR
ArtifactLens demonstrates that pretrained vision-language models can effectively detect artifacts in AI-generated images when augmented with a data-efficient scaffolding framework. By decomposing the task into specialized subproblems, cropping to informative regions, and optimizing prompts and demonstrations through in-context learning and full-spectrum prompting, the approach achieves state-of-the-art performance across five human-artifact benchmarks with hundreds of labeled examples. Key contributions include counterfactual demonstrations for ICL and full-spectrum prompting for text optimization, plus a robust multi-component architecture that preserves data efficiency while enabling generalization to non-human artifacts and AIGC detection. The findings suggest a practical path to scalable artifact detection that reduces labeling burdens and can adapt to evolving artifact taxonomies, with broad implications for benchmarking, reward modeling, and AI-regulation workflows.
Abstract
Modern image generators produce strikingly realistic images, where only artifacts like distorted hands or warped objects reveal their synthetic origin. Detecting these artifacts is essential: without detection, we cannot benchmark generators or train reward models to improve them. Current detectors fine-tune VLMs on tens of thousands of labeled images, but this is expensive to repeat whenever generators evolve or new artifact types emerge. We show that pretrained VLMs already encode the knowledge needed to detect artifacts - with the right scaffolding, this capability can be unlocked using only a few hundred labeled examples per artifact category. Our system, ArtifactLens, achieves state-of-the-art on five human artifact benchmarks (the first evaluation across multiple datasets) while requiring orders of magnitude less labeled data. The scaffolding consists of a multi-component architecture with in-context learning and text instruction optimization, with novel improvements to each. Our methods generalize to other artifact types - object morphology, animal anatomy, and entity interactions - and to the distinct task of AIGC detection.
