Table of Contents
Fetching ...

OpenFake: An Open Dataset and Platform Toward Real-World Deepfake Detection

Victor Livernoche, Akshatha Arodi, Andreea Musulan, Zachary Yang, Adam Salvail, Gaétan Marceau Caron, Jean-François Godbout, Reihaneh Rabbany

TL;DR

OpenFake introduces a large, politically grounded deepfake benchmark combining 3 million real images with nearly 1 million high-fidelity synthetic images from modern generators, plus OpenFake Arena for crowdsourced adversarial submissions. The framework demonstrates that detectors trained on OpenFake achieve near-perfect in-distribution accuracy and strong generalization to unseen generators, while performing well on a curated in-the-wild social-media test set. The authors further show that human perception struggles with advanced proprietary fakes, underscoring the necessity of robust automatic detectors. By providing an extensible, openly hosted benchmark and a live-adversarial platform, OpenFake offers a practical path toward real-world deepfake detection and ongoing alignment with evolving generative threats.

Abstract

Deepfakes, synthetic media created using advanced AI techniques, pose a growing threat to information integrity, particularly in politically sensitive contexts. This challenge is amplified by the increasing realism of modern generative models, which our human perception study confirms are often indistinguishable from real images. Yet, existing deepfake detection benchmarks rely on outdated generators or narrowly scoped datasets (e.g., single-face imagery), limiting their utility for real-world detection. To address these gaps, we present OpenFake, a large politically grounded dataset specifically crafted for benchmarking against modern generative models with high realism, and designed to remain extensible through an innovative crowdsourced adversarial platform that continually integrates new hard examples. OpenFake comprises nearly four million total images: three million real images paired with descriptive captions and almost one million synthetic counterparts from state-of-the-art proprietary and open-source models. Detectors trained on OpenFake achieve near-perfect in-distribution performance, strong generalization to unseen generators, and high accuracy on a curated in-the-wild social media test set, significantly outperforming models trained on existing datasets. Overall, we demonstrate that with high-quality and continually updated benchmarks, automatic deepfake detection is both feasible and effective in real-world settings.

OpenFake: An Open Dataset and Platform Toward Real-World Deepfake Detection

TL;DR

OpenFake introduces a large, politically grounded deepfake benchmark combining 3 million real images with nearly 1 million high-fidelity synthetic images from modern generators, plus OpenFake Arena for crowdsourced adversarial submissions. The framework demonstrates that detectors trained on OpenFake achieve near-perfect in-distribution accuracy and strong generalization to unseen generators, while performing well on a curated in-the-wild social-media test set. The authors further show that human perception struggles with advanced proprietary fakes, underscoring the necessity of robust automatic detectors. By providing an extensible, openly hosted benchmark and a live-adversarial platform, OpenFake offers a practical path toward real-world deepfake detection and ongoing alignment with evolving generative threats.

Abstract

Deepfakes, synthetic media created using advanced AI techniques, pose a growing threat to information integrity, particularly in politically sensitive contexts. This challenge is amplified by the increasing realism of modern generative models, which our human perception study confirms are often indistinguishable from real images. Yet, existing deepfake detection benchmarks rely on outdated generators or narrowly scoped datasets (e.g., single-face imagery), limiting their utility for real-world detection. To address these gaps, we present OpenFake, a large politically grounded dataset specifically crafted for benchmarking against modern generative models with high realism, and designed to remain extensible through an innovative crowdsourced adversarial platform that continually integrates new hard examples. OpenFake comprises nearly four million total images: three million real images paired with descriptive captions and almost one million synthetic counterparts from state-of-the-art proprietary and open-source models. Detectors trained on OpenFake achieve near-perfect in-distribution performance, strong generalization to unseen generators, and high accuracy on a curated in-the-wild social media test set, significantly outperforming models trained on existing datasets. Overall, we demonstrate that with high-quality and continually updated benchmarks, automatic deepfake detection is both feasible and effective in real-world settings.

Paper Structure

This paper contains 45 sections, 12 figures, 6 tables.

Figures (12)

  • Figure 1: We begin by scraping politically relevant images from social media (e.g., X, Reddit, Bluesky), filtered by election-related hashtags. Manual investigation of these social media images helps us to design a prompt for filtering politically relevant images. A vision-language model (e.g., Qwen2.5-VL) extracts thematic captions or prompts from real images from LAION. These prompts serve dual purposes: (1) forming a large bank of real image–prompt pairs, and (2) seeding generation across a range of synthetic image models (e.g., SDv3.5, Flux, Ideogram, GPT Image 1).
  • Figure 2: Examples of deepfake images collected from X depicting various types of fabricated scenarios involving Canadian political figures and events.
  • Figure 3: Examples of deepfakes from each model used in the survey with their respective real image.
  • Figure 4: t-SNE visualization of CLIP vision embeddings for 3,500 test images, including both real and synthetic images from a few generative models. Each point corresponds to an individual image, and colours indicate the generative model (or “real” for authentic images).
  • Figure 5: OpenFake Arena interface. Users are presented with a prompt and asked to generate an image that can fool the detector.
  • ...and 7 more figures