Table of Contents
Fetching ...

AMMeBa: A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild

Nicholas Dufour, Arkanath Pathak, Pouya Samangouei, Nikki Hariri, Shashi Deshetti, Andrew Dudfield, Christopher Guess, Pablo Hernández Escayola, Bobby Tran, Mevan Babakar, Christoph Bregler

TL;DR

AMMeBa tackles the lack of quantitative, longitudinal data on media-based misinformation by conducting a two-year, multi-stage human annotation of claims tied to ClaimReview. It introduces a detailed image-centric typology (media-based vs non-media, image types, and manipulation types) and produces a publicly available dataset that catalogs prevalence and modalities of misinformation in the wild, including the rise of AI-generated content after 2023. The findings show media involvement in roughly 80% of misinformation, a shift toward video, and a dominance of context-based manipulations, with AI imagery becoming increasingly prominent—though not yet overtaking other manipulation forms. This dataset offers a realistic benchmark for evaluating mitigation methods and tracking the evolution of media-based misinformation across modalities and external events.

Abstract

The prevalence and harms of online misinformation is a perennial concern for internet platforms, institutions and society at large. Over time, information shared online has become more media-heavy and misinformation has readily adapted to these new modalities. The rise of generative AI-based tools, which provide widely-accessible methods for synthesizing realistic audio, images, video and human-like text, have amplified these concerns. Despite intense public interest and significant press coverage, quantitative information on the prevalence and modality of media-based misinformation remains scarce. Here, we present the results of a two-year study using human raters to annotate online media-based misinformation, mostly focusing on images, based on claims assessed in a large sample of publicly-accessible fact checks with the ClaimReview markup. We present an image typology, designed to capture aspects of the image and manipulation relevant to the image's role in the misinformation claim. We visualize the distribution of these types over time. We show the rise of generative AI-based content in misinformation claims, and that its commonality is a relatively recent phenomenon, occurring significantly after heavy press coverage. We also show "simple" methods dominated historically, particularly context manipulations, and continued to hold a majority as of the end of data collection in November 2023. The dataset, Annotated Misinformation, Media-Based (AMMeBa), is publicly-available, and we hope that these data will serve as both a means of evaluating mitigation methods in a realistic setting and as a first-of-its-kind census of the types and modalities of online misinformation.

AMMeBa: A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild

TL;DR

AMMeBa tackles the lack of quantitative, longitudinal data on media-based misinformation by conducting a two-year, multi-stage human annotation of claims tied to ClaimReview. It introduces a detailed image-centric typology (media-based vs non-media, image types, and manipulation types) and produces a publicly available dataset that catalogs prevalence and modalities of misinformation in the wild, including the rise of AI-generated content after 2023. The findings show media involvement in roughly 80% of misinformation, a shift toward video, and a dominance of context-based manipulations, with AI imagery becoming increasingly prominent—though not yet overtaking other manipulation forms. This dataset offers a realistic benchmark for evaluating mitigation methods and tracking the evolution of media-based misinformation across modalities and external events.

Abstract

The prevalence and harms of online misinformation is a perennial concern for internet platforms, institutions and society at large. Over time, information shared online has become more media-heavy and misinformation has readily adapted to these new modalities. The rise of generative AI-based tools, which provide widely-accessible methods for synthesizing realistic audio, images, video and human-like text, have amplified these concerns. Despite intense public interest and significant press coverage, quantitative information on the prevalence and modality of media-based misinformation remains scarce. Here, we present the results of a two-year study using human raters to annotate online media-based misinformation, mostly focusing on images, based on claims assessed in a large sample of publicly-accessible fact checks with the ClaimReview markup. We present an image typology, designed to capture aspects of the image and manipulation relevant to the image's role in the misinformation claim. We visualize the distribution of these types over time. We show the rise of generative AI-based content in misinformation claims, and that its commonality is a relatively recent phenomenon, occurring significantly after heavy press coverage. We also show "simple" methods dominated historically, particularly context manipulations, and continued to hold a majority as of the end of data collection in November 2023. The dataset, Annotated Misinformation, Media-Based (AMMeBa), is publicly-available, and we hope that these data will serve as both a means of evaluating mitigation methods in a realistic setting and as a first-of-its-kind census of the types and modalities of online misinformation.
Paper Structure (54 sections, 31 figures, 2 tables)

This paper contains 54 sections, 31 figures, 2 tables.

Figures (31)

  • Figure 1: Examples of media occurring alongside fact-checked misinformation claims. In this report, we introduce a typology to capture the enormous variation in media-based (particularly image-based) misinformation seen in-the-wild and categorize a very large sample of misinformation claims with it.
  • Figure 2: Media manipulations have a long history. Top Left: A comparison of an image of Joseph Stalin, originally taken in 1937, where an associate, Nikolai Yezhov, is present along with a later version where he has been manually removed from the official image with airbrushing, following his fall from favor. Top Middle A still from the 1967 "Patterson-Gimlin film," alleged to show the cryptid Bigfoot, taken along the Klamath River in California. While it's authenticity is disputed to this day, experts regard the footage as a hoax. Top Right A 1994 Time Magazine cover featuring a mugshot of O.J. Simpson that had been artificially darkened to appear more sinister, compare the same photograph on the cover of Newsweek. Bottom Left The "Tourist Guy" image, originally taken in 1997. The creator of the image confessed to manipulating the image to add the airplane "as a joke." Bottom Middle A 2008 image of Kim Jong Il that a BBC analysis concluded was fake. Bottom Right A 2024 AI-generated image, dubbed "Shrimp Jesus," that went viral on social media.
  • Figure 3: Count of fact checks with annotations by publication date. Dates are approximate, and rely on the date reported by the fact checker or the date where the fact check was first encountered, preferring the self-reported dates where available. The dashed green line indicates the start of data collection, November 11, 2021.
  • Figure 4: Examples of "basic" images. Basic images are photograph-like, although they may be illustrations. They do not contain significant overlaid text, graphical or GUI elements. They are "coherent," depicting a unified scene, and do not comprise multiple sub-images in a collage or mosaic. Note these images have been cropped to square aspect ratio and resized.
  • Figure 5: Examples of "complex" images. Complex images include all images that are not basic images (\ref{['fig:basic_image']}, \ref{['sec:typology_basic_images']}), because they include graphical components, compositions of basic images, or are screenshots. Note these images have been cropped to square aspect ratio and resized.
  • ...and 26 more figures