Table of Contents
Fetching ...

Exposing Cross-Modal Consistency for Fake News Detection in Short-Form Videos

Chong Tian, Yu Wang, Chenxu Yang, Junyi Guan, Zheng Lin, Yuhan Liu, Xiuying Chen, Qirong Ho

Abstract

Short-form video platforms are major channels for news but also fertile ground for multimodal misinformation where each modality appears plausible alone yet cross-modal relationships are subtly inconsistent, like mismatched visuals and captions. On two benchmark datasets, FakeSV (Chinese) and FakeTT (English), we observe a clear asymmetry: real videos exhibit high text-visual but moderate text-audio consistency, while fake videos show the opposite pattern. Moreover, a single global consistency score forms an interpretable axis along which fake probability and prediction errors vary smoothly. Motivated by these observations, we present MAGIC3 (Modal-Adversarial Gated Interaction and Consistency-Centric Classifier), a detector that explicitly models and exposes cross-tri-modal consistency signals at multiple granularities. MAGIC3 combines explicit pairwise and global consistency modeling with token- and frame-level consistency signals derived from cross-modal attention, incorporates multi-style LLM rewrites to obtain style-robust text representations, and employs an uncertainty-aware classifier for selective VLM routing. Using pre-extracted features, MAGIC3 consistently outperforms the strongest non-VLM baselines on FakeSV and FakeTT. While matching VLM-level accuracy, the two-stage system achieves 18-27x higher throughput and 93% VRAM savings, offering a strong cost-performance tradeoff.

Exposing Cross-Modal Consistency for Fake News Detection in Short-Form Videos

Abstract

Short-form video platforms are major channels for news but also fertile ground for multimodal misinformation where each modality appears plausible alone yet cross-modal relationships are subtly inconsistent, like mismatched visuals and captions. On two benchmark datasets, FakeSV (Chinese) and FakeTT (English), we observe a clear asymmetry: real videos exhibit high text-visual but moderate text-audio consistency, while fake videos show the opposite pattern. Moreover, a single global consistency score forms an interpretable axis along which fake probability and prediction errors vary smoothly. Motivated by these observations, we present MAGIC3 (Modal-Adversarial Gated Interaction and Consistency-Centric Classifier), a detector that explicitly models and exposes cross-tri-modal consistency signals at multiple granularities. MAGIC3 combines explicit pairwise and global consistency modeling with token- and frame-level consistency signals derived from cross-modal attention, incorporates multi-style LLM rewrites to obtain style-robust text representations, and employs an uncertainty-aware classifier for selective VLM routing. Using pre-extracted features, MAGIC3 consistently outperforms the strongest non-VLM baselines on FakeSV and FakeTT. While matching VLM-level accuracy, the two-stage system achieves 18-27x higher throughput and 93% VRAM savings, offering a strong cost-performance tradeoff.
Paper Structure (47 sections, 32 equations, 7 figures, 11 tables)

This paper contains 47 sections, 32 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Illustration of cross-modal consistency patterns. In real news short videos, text, visuals, and audio are contextually aligned (Consistent). In fake news, a "semantic gap" often exists between the sensational claims (text/audio) and the actual visual content. MAGIC$^3$ acts as a consistency lens to quantify these multimodal relationships for fake news detection.
  • Figure 2: MAGIC$^3$ Overview. Frozen encoders provide text, visual, audio, and rewrite features. The Cross-Modal Consistency Gate outputs pairwise and global consistency scores; Consistency Field Estimator converts cross-modal attention into token- and frame-level consistency fields; Temporal Cross-Modal Inconsistency computes a temporal inconsistency score. Adversarial Aware Rewrite Fusion fuses original text with LLM rewrites into a style-robust representation; Hierarchical Multimodal Transformer performs hierarchical multimodal fusion; and the classifier outputs fake probability and uncertainty. The framework is trained via Contrastive Adversarial Joint Learning.
  • Figure 3: Cross-modal consistency distributions of text--visual, text--audio, visual--audio, and global consistency scores for real vs. fake videos. Real news shows high text--visual but moderate text--audio consistency; fake news flips this pattern, while visual--audio consistency remains high. Left: FakeSV; right: FakeTT.
  • Figure 4: Consistency--prediction analysis. Global consistency is strongly (negatively) correlated with fake probability; errors concentrate in the middle-consistency band, while $c_{\mathrm{global}}$ vs. entropy is non-linear with uncertainty peaking at mid-range consistency. Left: FakeSV; right: FakeTT.
  • Figure 5: Uncertainty and two-stage routing behaviour of MAGIC$^3$. Top: entropy/confidence distributions and calibration plots. Bottom: split between direct predictions and routed samples, and accuracies for routed vs. non-routed subsets when using a VLM-based stage-2 detector. Left: FakeSV; right: FakeTT.
  • ...and 2 more figures