FIDAVL: Fake Image Detection and Attribution using Vision-Language Model

Mamadou Keita; Wassim Hamidouche; Hessen Bougueffa Eutamene; Abdelmalik Taleb-Ahmed; Abdenour Hadid

FIDAVL: Fake Image Detection and Attribution using Vision-Language Model

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Abdenour Hadid

TL;DR

The paper addresses the rising challenge of distinguishing real from AI-generated images and tracing their origin to specific models. It introduces fidavl, a single-step multitask framework that uses a vision-language model and soft prompt tuning to perform detection and attribution via a VQA-style query, formalized as $\hat{y}=\mathcal{M}_{\theta}(I,q)$. On a large-scale dataset containing real images and synthetic images from GANs and diffusion models, fidavl achieves strong results (average detection accuracy around 95% and ROUGE-L around 96% for attribution) and demonstrates robust generalization to unseen generators. The work highlights the potential of vision-language models for visual forensics and provides public code to foster replication and future extensions.

Abstract

We introduce FIDAVL: Fake Image Detection and Attribution using a Vision-Language Model. FIDAVL is a novel and efficient mul-titask approach inspired by the synergies between vision and language processing. Leveraging the benefits of zero-shot learning, FIDAVL exploits the complementarity between vision and language along with soft prompt-tuning strategy to detect fake images and accurately attribute them to their originating source models. We conducted extensive experiments on a comprehensive dataset comprising synthetic images generated by various state-of-the-art models. Our results demonstrate that FIDAVL achieves an encouraging average detection accuracy of 95.42% and F1-score of 95.47% while also obtaining noteworthy performance metrics, with an average F1-score of 92.64% and ROUGE-L score of 96.50% for attributing synthetic images to their respective source generation models. The source code of this work will be publicly released at https://github.com/Mamadou-Keita/FIDAVL.

FIDAVL: Fake Image Detection and Attribution using Vision-Language Model

TL;DR

. On a large-scale dataset containing real images and synthetic images from GANs and diffusion models, fidavl achieves strong results (average detection accuracy around 95% and ROUGE-L around 96% for attribution) and demonstrates robust generalization to unseen generators. The work highlights the potential of vision-language models for visual forensics and provides public code to foster replication and future extensions.

Abstract

Paper Structure (15 sections, 2 equations, 4 figures, 3 tables)

This paper contains 15 sections, 2 equations, 4 figures, 3 tables.

Introduction
Background and Related Work
Generative Models
Synthetic Image Detection and Attribution
Vision Language Models
Prompt Tuning for Vision Language Models
Proposed Synthetic Image Detection and Localization
Problem Formulation
Soft Prompt Tuning
Experimental Results
Synthetic Image Detection
Comparative analysis.
Generalization to unseen generative models.
Synthetic Image Attribution
Conclusion and Future Work

Figures (4)

Figure 1: Architecture of the proposed synthetic image detection and localization.
Figure 2: Confusion matrices per testing subset on synthetic image detection task.
Figure 3: Confusion matrices indicate which synthetic images detected as synthetic are correctly classified according to their generating source model.
Figure 4: Confusion Matrix for Attribution Task: Synthetic data correctly classified as synthetic but attributed to a different source from the generating source.

FIDAVL: Fake Image Detection and Attribution using Vision-Language Model

TL;DR

Abstract

FIDAVL: Fake Image Detection and Attribution using Vision-Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (4)