Table of Contents
Fetching ...

LookupForensics: A Large-Scale Multi-Task Dataset for Multi-Phase Image-Based Fact Verification

Shuhan Cui, Huy H. Nguyen, Trung-Nghia Le, Chun-Shien Lu, Isao Echizen

TL;DR

This work formalizes image-based automated fact verification, combining forgery identification with a retrieval-based verification of original images. It introduces a two-phase open framework and a large-scale, multi-task dataset built on Google Open Images, featuring content-preserving and content-aware manipulations across copy-move, splicing, removal, and colorization. Extensive evaluations show the framework improves fact verification by jointly leveraging detection and retrieval, while revealing the dataset’s higher realism and challenge compared with prior ICD/ISC benchmarks. The approach advances trustworthy AI by enabling retrieval-backed authentication of images, with implications for scalability, interpretability, and robustness in real-world misinformation mitigation.

Abstract

Amid the proliferation of forged images, notably the tsunami of deepfake content, extensive research has been conducted on using artificial intelligence (AI) to identify forged content in the face of continuing advancements in counterfeiting technologies. We have investigated the use of AI to provide the original authentic image after deepfake detection, which we believe is a reliable and persuasive solution. We call this "image-based automated fact verification," a name that originated from a text-based fact-checking system used by journalists. We have developed a two-phase open framework that integrates detection and retrieval components. Additionally, inspired by a dataset proposed by Meta Fundamental AI Research, we further constructed a large-scale dataset that is specifically designed for this task. This dataset simulates real-world conditions and includes both content-preserving and content-aware manipulations that present a range of difficulty levels and have potential for ongoing research. This multi-task dataset is fully annotated, enabling it to be utilized for sub-tasks within the forgery identification and fact retrieval domains. This paper makes two main contributions: (1) We introduce a new task, "image-based automated fact verification," and present a novel two-phase open framework combining "forgery identification" and "fact retrieval." (2) We present a large-scale dataset tailored for this new task that features various hand-crafted image edits and machine learning-driven manipulations, with extensive annotations suitable for various sub-tasks. Extensive experimental results validate its practicality for fact verification research and clarify its difficulty levels for various sub-tasks.

LookupForensics: A Large-Scale Multi-Task Dataset for Multi-Phase Image-Based Fact Verification

TL;DR

This work formalizes image-based automated fact verification, combining forgery identification with a retrieval-based verification of original images. It introduces a two-phase open framework and a large-scale, multi-task dataset built on Google Open Images, featuring content-preserving and content-aware manipulations across copy-move, splicing, removal, and colorization. Extensive evaluations show the framework improves fact verification by jointly leveraging detection and retrieval, while revealing the dataset’s higher realism and challenge compared with prior ICD/ISC benchmarks. The approach advances trustworthy AI by enabling retrieval-backed authentication of images, with implications for scalability, interpretability, and robustness in real-world misinformation mitigation.

Abstract

Amid the proliferation of forged images, notably the tsunami of deepfake content, extensive research has been conducted on using artificial intelligence (AI) to identify forged content in the face of continuing advancements in counterfeiting technologies. We have investigated the use of AI to provide the original authentic image after deepfake detection, which we believe is a reliable and persuasive solution. We call this "image-based automated fact verification," a name that originated from a text-based fact-checking system used by journalists. We have developed a two-phase open framework that integrates detection and retrieval components. Additionally, inspired by a dataset proposed by Meta Fundamental AI Research, we further constructed a large-scale dataset that is specifically designed for this task. This dataset simulates real-world conditions and includes both content-preserving and content-aware manipulations that present a range of difficulty levels and have potential for ongoing research. This multi-task dataset is fully annotated, enabling it to be utilized for sub-tasks within the forgery identification and fact retrieval domains. This paper makes two main contributions: (1) We introduce a new task, "image-based automated fact verification," and present a novel two-phase open framework combining "forgery identification" and "fact retrieval." (2) We present a large-scale dataset tailored for this new task that features various hand-crafted image edits and machine learning-driven manipulations, with extensive annotations suitable for various sub-tasks. Extensive experimental results validate its practicality for fact verification research and clarify its difficulty levels for various sub-tasks.
Paper Structure (31 sections, 10 figures, 10 tables)

This paper contains 31 sections, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Upper half illustrates pipeline of our proposed open framework for image-based fact verification: It comprises two phases: forgery identification and fact retrieval. The modules therein can be replaced with almost any open-source toolbox. Lower half shows examples of two specific forgery types.
  • Figure 2: Distribution of the proportion of forgery region, forgery type, and forgery object class in our dataset.
  • Figure 3: Fact retrieval is divided into two branches: global retrieval and local retrieval. For an image without any overlays, global retrieval alone can find the original images. For images with one or more overlays, global retrieval is used to search for the entire image, and local retrieval is used to search for the detected forgery segments.
  • Figure 4: Examples of similar image pairs that were removed during preprocessing: (a) and (b) show visually similar image pairs, (c) shows images captured from different camera angles, and (d) shows images taken at the same location but at different times.
  • Figure 5: Examples of four types of forgery images in our dataset. Original image, forged image, and corresponding mask are shown for the first three types. Original color image, grayscale image, and forged image are shown for the last type (colorization).
  • ...and 5 more figures