UNITE-FND: Reframing Multimodal Fake News Detection through Unimodal Scene Translation
Arka Mukherjee, Shreya Ghosh
TL;DR
UNITE-FND tackles the heavy computational cost of multimodal fake news detection by transforming visual content into structured text with six prompting strategies, enabling unimodal text classifiers to operate on consumer hardware. The paper introduces Uni-Fakeddit-55k, a 55k-sample dataset derived from Fakeddit via the prompting framework, and validates that text-only models can match or surpass heavy multimodal baselines with far lower compute. It also presents five novel evaluation metrics (IPR, SCS, ISS, SIR, MTE) and CIQS to quantify information preservation in image-to-text translations. The results show 92.52% binary accuracy on Uni-Fakeddit-55k and substantial reductions in memory and cost (e.g., TinyBERT at 14.5M params, $2), indicating practical deployment potential and offering a scalable alternative for misinformation detection in resource-constrained settings.
Abstract
Multimodal fake news detection typically demands complex architectures and substantial computational resources, posing deployment challenges in real-world settings. We introduce UNITE-FND, a novel framework that reframes multimodal fake news detection as a unimodal text classification task. We propose six specialized prompting strategies with Gemini 1.5 Pro, converting visual content into structured textual descriptions, and enabling efficient text-only models to preserve critical visual information. To benchmark our approach, we introduce Uni-Fakeddit-55k, a curated dataset family of 55,000 samples each, each processed through our multimodal-to-unimodal translation framework. Experimental results demonstrate that UNITE-FND achieves 92.52% accuracy in binary classification, surpassing prior multimodal models while reducing computational costs by over 10x (TinyBERT variant: 14.5M parameters vs. 250M+ in SOTA models). Additionally, we propose a comprehensive suite of five novel metrics to evaluate image-to-text conversion quality, ensuring optimal information preservation. Our results demonstrate that structured text-based representations can replace direct multimodal processing with minimal loss of accuracy, making UNITE-FND a practical and scalable alternative for resource-constrained environments.
