Perception, Understanding and Reasoning, A Multimodal Benchmark for Video Fake News Detection
Cui Yakun, Fushuo Huo, Weijie Shi, Juntao Dai, Hang Du, Zhenghao Zhu, Sirui Han, Yike Guo
TL;DR
This paper tackles video fake news detection by introducing MVFNDB, a process-oriented benchmark designed to evaluate perception, understanding, and reasoning in multimodal large language models (MLLMs). It presents MVFND-CoT, a framework that jointly reasons over creator-added content and original shooting footage, and analyzes how video processing choices and feature alignment impact detection performance. Empirical analysis on the FakeSV dataset reveals discriminative cues in color, text layout, and footage types to motivate targeted tasks, while the benchmark itself provides data curation, task definitions, and diverse evaluation metrics. Experimental results across six MLLMs show that model scale and processing strategies interact with task type, with Gemini-2.5-Flash achieving leading performance and chain-of-thought prompting yielding notable gains, especially for larger models, underscoring the benchmark's value for guiding MVFND development.
Abstract
The advent of multi-modal large language models (MLLMs) has greatly advanced research into applications for Video fake news detection (VFND) tasks. Traditional video-based FND benchmarks typically focus on the accuracy of the final decision, often failing to provide fine-grained assessments for the entire detection process, making the detection process a black box. Therefore, we introduce the MVFNDB (Multi-modal Video Fake News Detection Benchmark) based on the empirical analysis, which provides foundation for tasks definition. The benchmark comprises 10 tasks and is meticulously crafted to probe MLLMs' perception, understanding, and reasoning capacities during detection, featuring 9730 human-annotated video-related questions based on a carefully constructed taxonomy ability of VFND. To validate the impact of combining multiple features on the final results, we design a novel framework named MVFND-CoT, which incorporates both creator-added content and original shooting footage reasoning. Building upon the benchmark, we conduct an in-depth analysis of the deeper factors influencing accuracy, including video processing strategies and the alignment between video features and model capabilities. We believe this benchmark will lay a solid foundation for future evaluations and advancements of MLLMs in the domain of video fake news detection.
