LingML: Linguistic-Informed Machine Learning for Enhanced Fake News Detection

Jasraj Singh; Fang Liu; Hong Xu; Bee Chin Ng; Wei Zhang

LingML: Linguistic-Informed Machine Learning for Enhanced Fake News Detection

Jasraj Singh, Fang Liu, Hong Xu, Bee Chin Ng, Wei Zhang

TL;DR

LingML tackles fake news detection on social media by integrating linguistic knowledge into ML and LLM-based frameworks. It introduces LingM, a linguistic-only ML approach using 18 LIWC features, and LingL, a linguistic-enhanced LLM method that fuses these features with LLM encodings for classification. Empirical results on a COVID-19 fake news dataset show that linguistic features substantially boost performance and interpretability, with LingL achieving as low as $1.8\%$ mean error and consistently outperforming baseline LLMs (up to $17.9\%$ in certain configurations). The work underscores the value of domain-informed features for improving generalization and transparency in misinformation detection, outlining directions for broader topic coverage and deeper integration.

Abstract

Nowadays, Information spreads at an unprecedented pace in social media and discerning truth from misinformation and fake news has become an acute societal challenge. Machine learning (ML) models have been employed to identify fake news but are far from perfect with challenging problems like limited accuracy, interpretability, and generalizability. In this paper, we enhance ML-based solutions with linguistics input and we propose LingML, linguistic-informed ML, for fake news detection. We conducted an experimental study with a popular dataset on fake news during the pandemic. The experiment results show that our proposed solution is highly effective. There are fewer than two errors out of every ten attempts with only linguistic input used in ML and the knowledge is highly explainable. When linguistics input is integrated with advanced large-scale ML models for natural language processing, our solution outperforms existing ones with 1.8% average error rate. LingML creates a new path with linguistics to push the frontier of effective and efficient fake news detection. It also sheds light on real-world multi-disciplinary applications requiring both ML and domain expertise to achieve optimal performance.

LingML: Linguistic-Informed Machine Learning for Enhanced Fake News Detection

TL;DR

mean error and consistently outperforming baseline LLMs (up to

in certain configurations). The work underscores the value of domain-informed features for improving generalization and transparency in misinformation detection, outlining directions for broader topic coverage and deeper integration.

Abstract

Paper Structure (22 sections, 4 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 4 figures, 3 tables, 1 algorithm.

Introduction
System Architecture
ML-based Fake News Detection
Linguistic Knowledge Integration
Methodology
Linguistic Knowledge and Feature Extraction
LingM: Linguistic-Only ML
LingL: Linguistic-Enhanced LLM
Experimental Study
Dataset and Data Pre-processing
Dataset Description
Data Pre-processing
LingM Performance Analysis
Experimental Setup
Fake News Detection Performance
...and 7 more sections

Figures (4)

Figure 1: The system architecture of the LingML for fake news detection. As an ML-based system, the news is the input and the output is true if it is fake news and false otherwise. The key technical challenge in the system is knowledge representation and utilization. LingML extracts linguistic knowledge from the input news and produces ML digestible linguistic features. The knowledge can be integrated into LingML in different ways and stages for achieving the optimal detection performance and the system reacts with risk mitigation measures once a fake news is detected.
Figure 2: An illustration of LingL's system architecture with news as input. Linguistic knowledge is extracted from news and represented as numerical features, which are concatenated with the pooled encoding of the news produced by a LLM. The concatenated vector is directed to a classifier to produce the detection outcome of the news.
Figure 3: A sample news from the dataset before and after data pre-processing with uncase, tokenization, and emoji conversion.
Figure 4: The ranking of the features based on their importance (as percentage) to LingM. The summation of all importance/percentages is 100%. Feature abbreviation is used for space saving and the detail can be found in Table \ref{['tab:method:feature']}.

LingML: Linguistic-Informed Machine Learning for Enhanced Fake News Detection

TL;DR

Abstract

LingML: Linguistic-Informed Machine Learning for Enhanced Fake News Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (4)