Table of Contents
Fetching ...

From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation

Zhihao Zhang, Yiran Zhang, Xiyue Zhou, Liting Huang, Imran Razzak, Preslav Nakov, Usman Naseem

TL;DR

This work tackles the growing problem of health misinformation amplified by generative AI by introducing MM-Health, a large-scale multimodal dataset containing 34,746 health news items with both human- and AI-generated text and images. It combines data from existing health misinformation sources with AI-generated counterparts across five text models and five image models, followed by a consensus-based post-processing pipeline and human evaluation to ensure data quality. The authors benchmark multiple Vision-Language Models on three tasks—information reliability, originality, and fine-grained AI detection—revealing that current models struggle to reliably verify content and identify provenance, especially under mixed human/AI content. By releasing MM-Health, the paper sets a challenging benchmark to spur development of more robust multimodal detectors and outlines directions for improving generalization and expanding to additional modalities in health misinformation detection.

Abstract

Infodemics and health misinformation have significant negative impact on individuals and society, exacerbating confusion and increasing hesitancy in adopting recommended health measures. Recent advancements in generative AI, capable of producing realistic, human like text and images, have significantly accelerated the spread and expanded the reach of health misinformation, resulting in an alarming surge in its dissemination. To combat the infodemics, most existing work has focused on developing misinformation datasets from social media and fact checking platforms, but has faced limitations in topical coverage, inclusion of AI generation, and accessibility of raw content. To address these issues, we present MM Health, a large scale multimodal misinformation dataset in the health domain consisting of 34,746 news article encompassing both textual and visual information. MM Health includes human-generated multimodal information (5,776 articles) and AI generated multimodal information (28,880 articles) from various SOTA generative AI models. Additionally, We benchmarked our dataset against three tasks (reliability checks, originality checks, and fine-grained AI detection) demonstrating that existing SOTA models struggle to accurately distinguish the reliability and origin of information. Our dataset aims to support the development of misinformation detection across various health scenarios, facilitating the detection of human and machine generated content at multimodal levels.

From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation

TL;DR

This work tackles the growing problem of health misinformation amplified by generative AI by introducing MM-Health, a large-scale multimodal dataset containing 34,746 health news items with both human- and AI-generated text and images. It combines data from existing health misinformation sources with AI-generated counterparts across five text models and five image models, followed by a consensus-based post-processing pipeline and human evaluation to ensure data quality. The authors benchmark multiple Vision-Language Models on three tasks—information reliability, originality, and fine-grained AI detection—revealing that current models struggle to reliably verify content and identify provenance, especially under mixed human/AI content. By releasing MM-Health, the paper sets a challenging benchmark to spur development of more robust multimodal detectors and outlines directions for improving generalization and expanding to additional modalities in health misinformation detection.

Abstract

Infodemics and health misinformation have significant negative impact on individuals and society, exacerbating confusion and increasing hesitancy in adopting recommended health measures. Recent advancements in generative AI, capable of producing realistic, human like text and images, have significantly accelerated the spread and expanded the reach of health misinformation, resulting in an alarming surge in its dissemination. To combat the infodemics, most existing work has focused on developing misinformation datasets from social media and fact checking platforms, but has faced limitations in topical coverage, inclusion of AI generation, and accessibility of raw content. To address these issues, we present MM Health, a large scale multimodal misinformation dataset in the health domain consisting of 34,746 news article encompassing both textual and visual information. MM Health includes human-generated multimodal information (5,776 articles) and AI generated multimodal information (28,880 articles) from various SOTA generative AI models. Additionally, We benchmarked our dataset against three tasks (reliability checks, originality checks, and fine-grained AI detection) demonstrating that existing SOTA models struggle to accurately distinguish the reliability and origin of information. Our dataset aims to support the development of misinformation detection across various health scenarios, facilitating the detection of human and machine generated content at multimodal levels.

Paper Structure

This paper contains 27 sections, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Data collection process for MM-Health includes: 1) utilising multiple existing open-source health misinformation datasets as data sources, 2) validating the available data samples and collecting human-generated multimodal data from the provided URLs, and 3) implementing generative AI models to collect AI-generated replicated multimodal data. To ensure data quality, both human and AI generated content are evaluated by five human evaluators proficient in English.
  • Figure 2: KDE distribution of the semantic similarity between the human articles and articles from five LLMs.
  • Figure 3: KDE distribution of image similarity between real and generated images across five image models.
  • Figure 4: Heatmap representation of the Task 3 fine-grained AI detection analysis. Each heatmap illustrates F1 scores from various VLLMs across twenty-five different combinations of AI-generated content. Darker colours represent higher F1 scores.
  • Figure 5: Sample of removed images after web scrapping. These images are irrelevant to the health topic, including blurry, logo-based, fuzzy, or meaningless images.
  • ...and 7 more figures