Table of Contents
Fetching ...

Task-Model Alignment: A Simple Path to Generalizable AI-Generated Image Detection

Ruoxin Chen, Jiahui Gao, Kaiqing Lin, Keyue Zhang, Yandan Zhao, Isabel Guan, Taiping Yao, Shouhong Ding

TL;DR

This work addresses the fragility of AI-generated image detectors, especially VLM-based systems, by diagnosing task-model misalignment between semantic understanding and pixel-forensics. It proposes Task-Model Alignment and a dual-branch detector, AlignGemini, pairing a semantic-tuned VLM with a pixel-artifact expert trained on orthogonal data, fused via an OR rule. Empirical results across in-the-wild benchmarks, a new AIGI-Now suite, and self-synthesized tests show improved generalization, robustness, and extensibility, withAlignGemini achieving notable gains over baselines while using a simpler training corpus. The authors also release AIGI-Now to better reflect real-world deployment scenarios and emphasize the practical potential of task-pure supervision for reliable AIGI detection.

Abstract

Vision Language Models (VLMs) are increasingly adopted for AI-generated images (AIGI) detection, yet converting VLMs into detectors requires substantial resource, while the resulting models still exhibit severe hallucinations. To probe the core issue, we conduct an empirical analysis and observe two characteristic behaviors: (i) fine-tuning VLMs on high-level semantic supervision strengthens semantic discrimination and well generalize to unseen data; (ii) fine-tuning VLMs on low-level pixel-artifact supervision yields poor transfer. We attribute VLMs' underperformance to task-model misalignment: semantics-oriented VLMs inherently lack sensitivity to fine-grained pixel artifacts, and semantically non-discriminative pixel artifacts thus exceeds their inductive biases. In contrast, we observe that conventional pixel-artifact detectors capture low-level pixel artifacts yet exhibit limited semantic awareness relative to VLMs, highlighting that distinct models are better matched to distinct tasks. In this paper, we formalize AIGI detection as two complementary tasks--semantic consistency checking and pixel-artifact detection--and show that neglecting either induces systematic blind spots. Guided by this view, we introduce the Task-Model Alignment principle and instantiate it as a two-branch detector, AlignGemini, comprising a VLM fine-tuned exclusively with pure semantic supervision and a pixel-artifact expert trained exclusively with pure pixel-artifact supervision. By enforcing orthogonal supervision on two simplified datasets, each branch trains to its strengths, producing complementary discrimination over semantic and pixel cues. On five in-the-wild benchmarks, AlignGemini delivers a +9.5 gain in average accuracy, supporting task-model alignment as an effective path to generalizable AIGI detection.

Task-Model Alignment: A Simple Path to Generalizable AI-Generated Image Detection

TL;DR

This work addresses the fragility of AI-generated image detectors, especially VLM-based systems, by diagnosing task-model misalignment between semantic understanding and pixel-forensics. It proposes Task-Model Alignment and a dual-branch detector, AlignGemini, pairing a semantic-tuned VLM with a pixel-artifact expert trained on orthogonal data, fused via an OR rule. Empirical results across in-the-wild benchmarks, a new AIGI-Now suite, and self-synthesized tests show improved generalization, robustness, and extensibility, withAlignGemini achieving notable gains over baselines while using a simpler training corpus. The authors also release AIGI-Now to better reflect real-world deployment scenarios and emphasize the practical potential of task-pure supervision for reliable AIGI detection.

Abstract

Vision Language Models (VLMs) are increasingly adopted for AI-generated images (AIGI) detection, yet converting VLMs into detectors requires substantial resource, while the resulting models still exhibit severe hallucinations. To probe the core issue, we conduct an empirical analysis and observe two characteristic behaviors: (i) fine-tuning VLMs on high-level semantic supervision strengthens semantic discrimination and well generalize to unseen data; (ii) fine-tuning VLMs on low-level pixel-artifact supervision yields poor transfer. We attribute VLMs' underperformance to task-model misalignment: semantics-oriented VLMs inherently lack sensitivity to fine-grained pixel artifacts, and semantically non-discriminative pixel artifacts thus exceeds their inductive biases. In contrast, we observe that conventional pixel-artifact detectors capture low-level pixel artifacts yet exhibit limited semantic awareness relative to VLMs, highlighting that distinct models are better matched to distinct tasks. In this paper, we formalize AIGI detection as two complementary tasks--semantic consistency checking and pixel-artifact detection--and show that neglecting either induces systematic blind spots. Guided by this view, we introduce the Task-Model Alignment principle and instantiate it as a two-branch detector, AlignGemini, comprising a VLM fine-tuned exclusively with pure semantic supervision and a pixel-artifact expert trained exclusively with pure pixel-artifact supervision. By enforcing orthogonal supervision on two simplified datasets, each branch trains to its strengths, producing complementary discrimination over semantic and pixel cues. On five in-the-wild benchmarks, AlignGemini delivers a +9.5 gain in average accuracy, supporting task-model alignment as an effective path to generalizable AIGI detection.

Paper Structure

This paper contains 27 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Illustration of blind spots of semantic and pixel-artifact detectors. Top: semantically-faithful synthetic images evade semantic detectors despite clear pixel artifacts. Bottom: heavy degradations (e.g., compression, resizing) destroy the generation pixel trace, greatly reducing pixel-artifact detectors' detection rate by 16%+, even for evidently implausible synthetic images.
  • Figure 2: Analyze the performance of a conventional vision backbone (DINOv2) and a VLM (Qwen2.5-VL-7B) under semantic vs. pixel-artifact supervision. “Semantic/Pixel Val” denotes synthetic images generated by the same generator used in training, while “Semantic/Pixel Test” denotes synthetic images from unseen generators. Top: DINOv2 trained with semantic supervision generalizes poorly on the semantic test set, whereas DINOv2 trained with pixel-artifact supervision generalizes well on the pixel test set. Bottom: Qwen-VL trained with semantic supervision reliably identifies semantic flaws from unseen generators, while pixel-artifact supervision fails to enhance its pixel-artifact detection. Rightmost Column: The mixed semantic-pixel supervision undermines each model’s inherent strengths. Together, these results indicate a clear task-model alignment: conventional vision model is better suited for pure pixel-artifact cues, whereas VLM is better suited for pure semantic cues.
  • Figure 3: Comparison between AIGI-Now and existing benchmarks.Top: Existing benchmarks mainly rely on pre-2023 open-source generators. By contrast, real-world usage is dominated by newer commercial, closed-source models, creating a clear gap between the existing benchmark data and real-world fakes. Bottom: Our proposed AIGI-Now is constructed from the latest generators (all released after 2024), including six commercial, closed-source models such as GPT-4o with unknown architectures, enabling rigorous evaluation of detectors’ cross-architecture generalization and making the benchmark substantially closer to real-world application scenarios.
  • Figure 4: Robustness evaluation on AIGI-Now (pixel). Double JPEG compression applies JPEG compression with the same quality factor twice. Double resizing first downsamples the image and then upsamples it to the original size. The results show that our AlignGemini consistently exhibits superior robustness consistently under these post-processing operations.
  • Figure 5: Isolated impact of VLM and expert branches. (a) Evaluates pixel-artifact detection. (b) Evaluates semantic detection.