Table of Contents
Fetching ...

FIDAVL: Fake Image Detection and Attribution using Vision-Language Model

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Abdenour Hadid

TL;DR

The paper addresses the rising challenge of distinguishing real from AI-generated images and tracing their origin to specific models. It introduces fidavl, a single-step multitask framework that uses a vision-language model and soft prompt tuning to perform detection and attribution via a VQA-style query, formalized as $\hat{y}=\mathcal{M}_{\theta}(I,q)$. On a large-scale dataset containing real images and synthetic images from GANs and diffusion models, fidavl achieves strong results (average detection accuracy around 95% and ROUGE-L around 96% for attribution) and demonstrates robust generalization to unseen generators. The work highlights the potential of vision-language models for visual forensics and provides public code to foster replication and future extensions.

Abstract

We introduce FIDAVL: Fake Image Detection and Attribution using a Vision-Language Model. FIDAVL is a novel and efficient mul-titask approach inspired by the synergies between vision and language processing. Leveraging the benefits of zero-shot learning, FIDAVL exploits the complementarity between vision and language along with soft prompt-tuning strategy to detect fake images and accurately attribute them to their originating source models. We conducted extensive experiments on a comprehensive dataset comprising synthetic images generated by various state-of-the-art models. Our results demonstrate that FIDAVL achieves an encouraging average detection accuracy of 95.42% and F1-score of 95.47% while also obtaining noteworthy performance metrics, with an average F1-score of 92.64% and ROUGE-L score of 96.50% for attributing synthetic images to their respective source generation models. The source code of this work will be publicly released at https://github.com/Mamadou-Keita/FIDAVL.

FIDAVL: Fake Image Detection and Attribution using Vision-Language Model

TL;DR

The paper addresses the rising challenge of distinguishing real from AI-generated images and tracing their origin to specific models. It introduces fidavl, a single-step multitask framework that uses a vision-language model and soft prompt tuning to perform detection and attribution via a VQA-style query, formalized as . On a large-scale dataset containing real images and synthetic images from GANs and diffusion models, fidavl achieves strong results (average detection accuracy around 95% and ROUGE-L around 96% for attribution) and demonstrates robust generalization to unseen generators. The work highlights the potential of vision-language models for visual forensics and provides public code to foster replication and future extensions.

Abstract

We introduce FIDAVL: Fake Image Detection and Attribution using a Vision-Language Model. FIDAVL is a novel and efficient mul-titask approach inspired by the synergies between vision and language processing. Leveraging the benefits of zero-shot learning, FIDAVL exploits the complementarity between vision and language along with soft prompt-tuning strategy to detect fake images and accurately attribute them to their originating source models. We conducted extensive experiments on a comprehensive dataset comprising synthetic images generated by various state-of-the-art models. Our results demonstrate that FIDAVL achieves an encouraging average detection accuracy of 95.42% and F1-score of 95.47% while also obtaining noteworthy performance metrics, with an average F1-score of 92.64% and ROUGE-L score of 96.50% for attributing synthetic images to their respective source generation models. The source code of this work will be publicly released at https://github.com/Mamadou-Keita/FIDAVL.
Paper Structure (15 sections, 2 equations, 4 figures, 3 tables)

This paper contains 15 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Architecture of the proposed synthetic image detection and localization.
  • Figure 2: Confusion matrices per testing subset on synthetic image detection task.
  • Figure 3: Confusion matrices indicate which synthetic images detected as synthetic are correctly classified according to their generating source model.
  • Figure 4: Confusion Matrix for Attribution Task: Synthetic data correctly classified as synthetic but attributed to a different source from the generating source.