Table of Contents
Fetching ...

Strengthening False Information Propagation Detection: Leveraging SVM and Sophisticated Text Vectorization Techniques in comparison to BERT

Ahmed Akib Jawad Karim, Kazi Hafiz Md Asad, Aznur Azam

TL;DR

The paper tackles fake news detection by systematically comparing SVM classifiers using BoW, TF-IDF, and Word2Vec vectorizations against a BERT-base transformer on the ISOT Fake News Dataset. It demonstrates that SVM with BoW and TF-IDF achieves near-top performance (approximately 99.8% accuracy and near-0.999 F1) while requiring substantially less computational resources than BERT, which reaches 99.98% accuracy and 0.9998 F1 but demands GPU-based training. The study also investigates RBF kernels, revealing marginal gains over linear kernels for some vectorizations. Overall, the work highlights efficient, scalable alternatives to heavyweight transformers and suggests hybrid approaches that blend contextual embeddings with lightweight classifiers for robust fake news detection.

Abstract

The rapid spread of misinformation, particularly through online platforms, underscores the urgent need for reliable detection systems. This study explores the utilization of machine learning and natural language processing, specifically Support Vector Machines (SVM) and BERT, to detect fake news. We employ three distinct text vectorization methods for SVM: Term Frequency Inverse Document Frequency (TF-IDF), Word2Vec, and Bag of Words (BoW), evaluating their effectiveness in distinguishing between genuine and fake news. Additionally, we compare these methods against the transformer large language model, BERT. Our comprehensive approach includes detailed preprocessing steps, rigorous model implementation, and thorough evaluation to determine the most effective techniques. The results demonstrate that while BERT achieves superior accuracy with 99.98% and an F1-score of 0.9998, the SVM model with a linear kernel and BoW vectorization also performs exceptionally well, achieving 99.81% accuracy and an F1-score of 0.9980. These findings highlight that, despite BERT's superior performance, SVM models with BoW and TF-IDF vectorization methods come remarkably close, offering highly competitive performance with the advantage of lower computational requirements.

Strengthening False Information Propagation Detection: Leveraging SVM and Sophisticated Text Vectorization Techniques in comparison to BERT

TL;DR

The paper tackles fake news detection by systematically comparing SVM classifiers using BoW, TF-IDF, and Word2Vec vectorizations against a BERT-base transformer on the ISOT Fake News Dataset. It demonstrates that SVM with BoW and TF-IDF achieves near-top performance (approximately 99.8% accuracy and near-0.999 F1) while requiring substantially less computational resources than BERT, which reaches 99.98% accuracy and 0.9998 F1 but demands GPU-based training. The study also investigates RBF kernels, revealing marginal gains over linear kernels for some vectorizations. Overall, the work highlights efficient, scalable alternatives to heavyweight transformers and suggests hybrid approaches that blend contextual embeddings with lightweight classifiers for robust fake news detection.

Abstract

The rapid spread of misinformation, particularly through online platforms, underscores the urgent need for reliable detection systems. This study explores the utilization of machine learning and natural language processing, specifically Support Vector Machines (SVM) and BERT, to detect fake news. We employ three distinct text vectorization methods for SVM: Term Frequency Inverse Document Frequency (TF-IDF), Word2Vec, and Bag of Words (BoW), evaluating their effectiveness in distinguishing between genuine and fake news. Additionally, we compare these methods against the transformer large language model, BERT. Our comprehensive approach includes detailed preprocessing steps, rigorous model implementation, and thorough evaluation to determine the most effective techniques. The results demonstrate that while BERT achieves superior accuracy with 99.98% and an F1-score of 0.9998, the SVM model with a linear kernel and BoW vectorization also performs exceptionally well, achieving 99.81% accuracy and an F1-score of 0.9980. These findings highlight that, despite BERT's superior performance, SVM models with BoW and TF-IDF vectorization methods come remarkably close, offering highly competitive performance with the advantage of lower computational requirements.

Paper Structure

This paper contains 21 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: 3D t-SNE of the Dataset
  • Figure 2: Workflow Diagram for Fake News Detection
  • Figure 3: Confusion Matrix (Linear Kernel)
  • Figure 4: Support Vectors (Linear Kernel)
  • Figure 5: Receiver Operating Characteristic (Linear Kernel)
  • ...and 1 more figures