Exposing Cross-Modal Consistency for Fake News Detection in Short-Form Videos

Chong Tian; Yu Wang; Chenxu Yang; Junyi Guan; Zheng Lin; Yuhan Liu; Xiuying Chen; Qirong Ho

Exposing Cross-Modal Consistency for Fake News Detection in Short-Form Videos

Chong Tian, Yu Wang, Chenxu Yang, Junyi Guan, Zheng Lin, Yuhan Liu, Xiuying Chen, Qirong Ho

Abstract

Short-form video platforms are major channels for news but also fertile ground for multimodal misinformation where each modality appears plausible alone yet cross-modal relationships are subtly inconsistent, like mismatched visuals and captions. On two benchmark datasets, FakeSV (Chinese) and FakeTT (English), we observe a clear asymmetry: real videos exhibit high text-visual but moderate text-audio consistency, while fake videos show the opposite pattern. Moreover, a single global consistency score forms an interpretable axis along which fake probability and prediction errors vary smoothly. Motivated by these observations, we present MAGIC3 (Modal-Adversarial Gated Interaction and Consistency-Centric Classifier), a detector that explicitly models and exposes cross-tri-modal consistency signals at multiple granularities. MAGIC3 combines explicit pairwise and global consistency modeling with token- and frame-level consistency signals derived from cross-modal attention, incorporates multi-style LLM rewrites to obtain style-robust text representations, and employs an uncertainty-aware classifier for selective VLM routing. Using pre-extracted features, MAGIC3 consistently outperforms the strongest non-VLM baselines on FakeSV and FakeTT. While matching VLM-level accuracy, the two-stage system achieves 18-27x higher throughput and 93% VRAM savings, offering a strong cost-performance tradeoff.

Exposing Cross-Modal Consistency for Fake News Detection in Short-Form Videos

Abstract

Paper Structure (47 sections, 32 equations, 7 figures, 11 tables)

This paper contains 47 sections, 32 equations, 7 figures, 11 tables.

Introduction
A consistency-centric view.
Key findings.
Our method.
Contributions.
Related Work
Multimodal fake news on image--text pairs.
Short news video detection.
Methodology
Problem Formulation
Feature Extraction
Consistency Computation
Cross-Modal Fusion
Training and Prediction
Experiments
...and 32 more sections

Figures (7)

Figure 1: Illustration of cross-modal consistency patterns. In real news short videos, text, visuals, and audio are contextually aligned (Consistent). In fake news, a "semantic gap" often exists between the sensational claims (text/audio) and the actual visual content. MAGIC$^3$ acts as a consistency lens to quantify these multimodal relationships for fake news detection.
Figure 2: MAGIC$^3$ Overview. Frozen encoders provide text, visual, audio, and rewrite features. The Cross-Modal Consistency Gate outputs pairwise and global consistency scores; Consistency Field Estimator converts cross-modal attention into token- and frame-level consistency fields; Temporal Cross-Modal Inconsistency computes a temporal inconsistency score. Adversarial Aware Rewrite Fusion fuses original text with LLM rewrites into a style-robust representation; Hierarchical Multimodal Transformer performs hierarchical multimodal fusion; and the classifier outputs fake probability and uncertainty. The framework is trained via Contrastive Adversarial Joint Learning.
Figure 3: Cross-modal consistency distributions of text--visual, text--audio, visual--audio, and global consistency scores for real vs. fake videos. Real news shows high text--visual but moderate text--audio consistency; fake news flips this pattern, while visual--audio consistency remains high. Left: FakeSV; right: FakeTT.
Figure 4: Consistency--prediction analysis. Global consistency is strongly (negatively) correlated with fake probability; errors concentrate in the middle-consistency band, while $c_{\mathrm{global}}$ vs. entropy is non-linear with uncertainty peaking at mid-range consistency. Left: FakeSV; right: FakeTT.
Figure 5: Uncertainty and two-stage routing behaviour of MAGIC$^3$. Top: entropy/confidence distributions and calibration plots. Bottom: split between direct predictions and routed samples, and accuracies for routed vs. non-routed subsets when using a VLM-based stage-2 detector. Left: FakeSV; right: FakeTT.
...and 2 more figures

Exposing Cross-Modal Consistency for Fake News Detection in Short-Form Videos

Abstract

Exposing Cross-Modal Consistency for Fake News Detection in Short-Form Videos

Authors

Abstract

Table of Contents

Figures (7)