UNITE-FND: Reframing Multimodal Fake News Detection through Unimodal Scene Translation

Arka Mukherjee; Shreya Ghosh

UNITE-FND: Reframing Multimodal Fake News Detection through Unimodal Scene Translation

Arka Mukherjee, Shreya Ghosh

TL;DR

UNITE-FND tackles the heavy computational cost of multimodal fake news detection by transforming visual content into structured text with six prompting strategies, enabling unimodal text classifiers to operate on consumer hardware. The paper introduces Uni-Fakeddit-55k, a 55k-sample dataset derived from Fakeddit via the prompting framework, and validates that text-only models can match or surpass heavy multimodal baselines with far lower compute. It also presents five novel evaluation metrics (IPR, SCS, ISS, SIR, MTE) and CIQS to quantify information preservation in image-to-text translations. The results show 92.52% binary accuracy on Uni-Fakeddit-55k and substantial reductions in memory and cost (e.g., TinyBERT at 14.5M params, $2), indicating practical deployment potential and offering a scalable alternative for misinformation detection in resource-constrained settings.

Abstract

Multimodal fake news detection typically demands complex architectures and substantial computational resources, posing deployment challenges in real-world settings. We introduce UNITE-FND, a novel framework that reframes multimodal fake news detection as a unimodal text classification task. We propose six specialized prompting strategies with Gemini 1.5 Pro, converting visual content into structured textual descriptions, and enabling efficient text-only models to preserve critical visual information. To benchmark our approach, we introduce Uni-Fakeddit-55k, a curated dataset family of 55,000 samples each, each processed through our multimodal-to-unimodal translation framework. Experimental results demonstrate that UNITE-FND achieves 92.52% accuracy in binary classification, surpassing prior multimodal models while reducing computational costs by over 10x (TinyBERT variant: 14.5M parameters vs. 250M+ in SOTA models). Additionally, we propose a comprehensive suite of five novel metrics to evaluate image-to-text conversion quality, ensuring optimal information preservation. Our results demonstrate that structured text-based representations can replace direct multimodal processing with minimal loss of accuracy, making UNITE-FND a practical and scalable alternative for resource-constrained environments.

UNITE-FND: Reframing Multimodal Fake News Detection through Unimodal Scene Translation

TL;DR

Abstract

UNITE-FND: Reframing Multimodal Fake News Detection through Unimodal Scene Translation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)