Table of Contents
Fetching ...

Generalization Gaps in Political Fake News Detection: An Empirical Study on the LIAR Dataset

S Mahmudul Hasan, Shaily Roy, Akib Jawad Nafis

TL;DR

Political fake-news detection based on text alone suffers from limited discriminative signals in short political statements, as shown on the LIAR dataset. The authors perform a broad diagnostic benchmark across nine algorithms using lexical (BoW/TF-IDF) and semantic (GloVe) representations, revealing a consistent performance ceiling (~0.32 Weighted F1 for fine-grained, ~0.64 for binary) and a large generalization gap, with high-capacity models memorizing training data. A SMOTE augmentation fails to improve results, indicating that the bottleneck is semantic ambiguity rather than distributional imbalance. The study concludes that gains from increasing model complexity are limited without external knowledge, suggesting future work should integrate external evidence, knowledge sources, or multi-modal signals for robust political fact-checking.

Abstract

The proliferation of linguistically subtle political disinformation poses a significant challenge to automated fact-checking systems. Despite increasing emphasis on complex neural architectures, the empirical limits of text-only linguistic modeling remain underexplored. We present a systematic diagnostic evaluation of nine machine learning algorithms on the LIAR benchmark. By isolating lexical features (Bag-of-Words, TF-IDF) and semantic embeddings (GloVe), we uncover a hard "Performance Ceiling", with fine-grained classification not exceeding a Weighted F1-score of 0.32 across models. Crucially, a simple linear SVM (Accuracy: 0.624) matches the performance of pre-trained Transformers such as RoBERTa (Accuracy: 0.620), suggesting that model capacity is not the primary bottleneck. We further diagnose a massive "Generalization Gap" in tree-based ensembles, which achieve more than 99% training accuracy but collapse to approximately 25% on test data, indicating reliance on lexical memorization rather than semantic inference. Synthetic data augmentation via SMOTE yields no meaningful gains, confirming that the limitation is semantic (feature ambiguity) rather than distributional. These findings indicate that for political fact-checking, increasing model complexity without incorporating external knowledge yields diminishing returns.

Generalization Gaps in Political Fake News Detection: An Empirical Study on the LIAR Dataset

TL;DR

Political fake-news detection based on text alone suffers from limited discriminative signals in short political statements, as shown on the LIAR dataset. The authors perform a broad diagnostic benchmark across nine algorithms using lexical (BoW/TF-IDF) and semantic (GloVe) representations, revealing a consistent performance ceiling (~0.32 Weighted F1 for fine-grained, ~0.64 for binary) and a large generalization gap, with high-capacity models memorizing training data. A SMOTE augmentation fails to improve results, indicating that the bottleneck is semantic ambiguity rather than distributional imbalance. The study concludes that gains from increasing model complexity are limited without external knowledge, suggesting future work should integrate external evidence, knowledge sources, or multi-modal signals for robust political fact-checking.

Abstract

The proliferation of linguistically subtle political disinformation poses a significant challenge to automated fact-checking systems. Despite increasing emphasis on complex neural architectures, the empirical limits of text-only linguistic modeling remain underexplored. We present a systematic diagnostic evaluation of nine machine learning algorithms on the LIAR benchmark. By isolating lexical features (Bag-of-Words, TF-IDF) and semantic embeddings (GloVe), we uncover a hard "Performance Ceiling", with fine-grained classification not exceeding a Weighted F1-score of 0.32 across models. Crucially, a simple linear SVM (Accuracy: 0.624) matches the performance of pre-trained Transformers such as RoBERTa (Accuracy: 0.620), suggesting that model capacity is not the primary bottleneck. We further diagnose a massive "Generalization Gap" in tree-based ensembles, which achieve more than 99% training accuracy but collapse to approximately 25% on test data, indicating reliance on lexical memorization rather than semantic inference. Synthetic data augmentation via SMOTE yields no meaningful gains, confirming that the limitation is semantic (feature ambiguity) rather than distributional. These findings indicate that for political fact-checking, increasing model complexity without incorporating external knowledge yields diminishing returns.

Paper Structure

This paper contains 24 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Schematic of the experimental framework comparing lexical (TF-IDF) and semantic (GloVe) feature representations across linear and ensemble models to quantify the generalization gap.
  • Figure 2: Label Distribution across Training, Validation, and Testing splits. The varying frequencies necessitated the use of SMOTE for class balancing.
  • Figure 3: Benchmarking Results. Left: Our Extra Trees model using GloVe (300d) features achieves an accuracy of 0.262, rivaling the Hybrid CNN baseline (0.274). Right: Comparisons against Khan et al. reveal that our SVM model using Bag-of-Words features (0.624) outperforms the Traditional Naive Bayes baseline (0.60) and matches the pre-trained RoBERTa model (0.62).
  • Figure 4: Performance Leaderboard with Feature Annotations. The text inside each bar indicates the optimal feature set for that model. Top: Multi-class performance plateaus at 0.32 F1. Bottom: Binary performance jumps to 0.63 F1, but the ceiling effect remains.
  • Figure 5: Raw vs. SMOTE Performance. The side-by-side comparison shows that synthetic oversampling (Red) offers no significant advantage over the raw baseline (Blue), confirming that the limitation is semantic (feature ambiguity) rather than distributional (data quantity).