Table of Contents
Fetching ...

UNITE-FND: Reframing Multimodal Fake News Detection through Unimodal Scene Translation

Arka Mukherjee, Shreya Ghosh

TL;DR

UNITE-FND tackles the heavy computational cost of multimodal fake news detection by transforming visual content into structured text with six prompting strategies, enabling unimodal text classifiers to operate on consumer hardware. The paper introduces Uni-Fakeddit-55k, a 55k-sample dataset derived from Fakeddit via the prompting framework, and validates that text-only models can match or surpass heavy multimodal baselines with far lower compute. It also presents five novel evaluation metrics (IPR, SCS, ISS, SIR, MTE) and CIQS to quantify information preservation in image-to-text translations. The results show 92.52% binary accuracy on Uni-Fakeddit-55k and substantial reductions in memory and cost (e.g., TinyBERT at 14.5M params, $2), indicating practical deployment potential and offering a scalable alternative for misinformation detection in resource-constrained settings.

Abstract

Multimodal fake news detection typically demands complex architectures and substantial computational resources, posing deployment challenges in real-world settings. We introduce UNITE-FND, a novel framework that reframes multimodal fake news detection as a unimodal text classification task. We propose six specialized prompting strategies with Gemini 1.5 Pro, converting visual content into structured textual descriptions, and enabling efficient text-only models to preserve critical visual information. To benchmark our approach, we introduce Uni-Fakeddit-55k, a curated dataset family of 55,000 samples each, each processed through our multimodal-to-unimodal translation framework. Experimental results demonstrate that UNITE-FND achieves 92.52% accuracy in binary classification, surpassing prior multimodal models while reducing computational costs by over 10x (TinyBERT variant: 14.5M parameters vs. 250M+ in SOTA models). Additionally, we propose a comprehensive suite of five novel metrics to evaluate image-to-text conversion quality, ensuring optimal information preservation. Our results demonstrate that structured text-based representations can replace direct multimodal processing with minimal loss of accuracy, making UNITE-FND a practical and scalable alternative for resource-constrained environments.

UNITE-FND: Reframing Multimodal Fake News Detection through Unimodal Scene Translation

TL;DR

UNITE-FND tackles the heavy computational cost of multimodal fake news detection by transforming visual content into structured text with six prompting strategies, enabling unimodal text classifiers to operate on consumer hardware. The paper introduces Uni-Fakeddit-55k, a 55k-sample dataset derived from Fakeddit via the prompting framework, and validates that text-only models can match or surpass heavy multimodal baselines with far lower compute. It also presents five novel evaluation metrics (IPR, SCS, ISS, SIR, MTE) and CIQS to quantify information preservation in image-to-text translations. The results show 92.52% binary accuracy on Uni-Fakeddit-55k and substantial reductions in memory and cost (e.g., TinyBERT at 14.5M params, $2), indicating practical deployment potential and offering a scalable alternative for misinformation detection in resource-constrained settings.

Abstract

Multimodal fake news detection typically demands complex architectures and substantial computational resources, posing deployment challenges in real-world settings. We introduce UNITE-FND, a novel framework that reframes multimodal fake news detection as a unimodal text classification task. We propose six specialized prompting strategies with Gemini 1.5 Pro, converting visual content into structured textual descriptions, and enabling efficient text-only models to preserve critical visual information. To benchmark our approach, we introduce Uni-Fakeddit-55k, a curated dataset family of 55,000 samples each, each processed through our multimodal-to-unimodal translation framework. Experimental results demonstrate that UNITE-FND achieves 92.52% accuracy in binary classification, surpassing prior multimodal models while reducing computational costs by over 10x (TinyBERT variant: 14.5M parameters vs. 250M+ in SOTA models). Additionally, we propose a comprehensive suite of five novel metrics to evaluate image-to-text conversion quality, ensuring optimal information preservation. Our results demonstrate that structured text-based representations can replace direct multimodal processing with minimal loss of accuracy, making UNITE-FND a practical and scalable alternative for resource-constrained environments.

Paper Structure

This paper contains 123 sections, 10 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Overview of the UNITE-FND framework. Our approach transforms multimodal fake news detection into a unimodal task through specialized prompting strategies and efficient text classification.
  • Figure 2: Dataset creation pipeline for Uni-Fakeddit-55k. Each entry from the Fakeddit dataset is processed using six specialized prompting strategies with Gemini 1.5 Pro for image-to-text conversion. The pipeline consists of initial preprocessing, parallel prompting pathways, and structured dataset organization, generating six complementary text-based representations.
  • Figure 3: Distribution comparison between our Uni-Fakeddit-55k dataset (left) and the original Fakeddit-700k dataset (right). The pie charts demonstrate that our sampling strategy preserves the relative proportions of different content categories while creating a more manageable dataset size. Both datasets maintain similar class distributions across six categories: True, Satire/Parody, Misleading Content, Manipulated Content, False Content, and Imposter Content, with variations of less than 0.5% in relative proportions.
  • Figure 4: Illustration of the List of Objects prompting approach. The system takes two inputs: (1) a carefully engineered text prompt that requests a comma-separated list of distinct, identifiable objects, and (2) the target image (shown: futuristic cityscape with Eiffel Tower). Gemini 1.5 Pro processes these inputs to generate a structured CSV output containing all major visible objects.
  • Figure :
  • ...and 8 more figures