LingML: Linguistic-Informed Machine Learning for Enhanced Fake News Detection
Jasraj Singh, Fang Liu, Hong Xu, Bee Chin Ng, Wei Zhang
TL;DR
LingML tackles fake news detection on social media by integrating linguistic knowledge into ML and LLM-based frameworks. It introduces LingM, a linguistic-only ML approach using 18 LIWC features, and LingL, a linguistic-enhanced LLM method that fuses these features with LLM encodings for classification. Empirical results on a COVID-19 fake news dataset show that linguistic features substantially boost performance and interpretability, with LingL achieving as low as $1.8\%$ mean error and consistently outperforming baseline LLMs (up to $17.9\%$ in certain configurations). The work underscores the value of domain-informed features for improving generalization and transparency in misinformation detection, outlining directions for broader topic coverage and deeper integration.
Abstract
Nowadays, Information spreads at an unprecedented pace in social media and discerning truth from misinformation and fake news has become an acute societal challenge. Machine learning (ML) models have been employed to identify fake news but are far from perfect with challenging problems like limited accuracy, interpretability, and generalizability. In this paper, we enhance ML-based solutions with linguistics input and we propose LingML, linguistic-informed ML, for fake news detection. We conducted an experimental study with a popular dataset on fake news during the pandemic. The experiment results show that our proposed solution is highly effective. There are fewer than two errors out of every ten attempts with only linguistic input used in ML and the knowledge is highly explainable. When linguistics input is integrated with advanced large-scale ML models for natural language processing, our solution outperforms existing ones with 1.8% average error rate. LingML creates a new path with linguistics to push the frontier of effective and efficient fake news detection. It also sheds light on real-world multi-disciplinary applications requiring both ML and domain expertise to achieve optimal performance.
