Capturing Pertinent Symbolic Features for Enhanced Content-Based Misinformation Detection
Flavio Merenda, José Manuel Gómez-Pérez
TL;DR
The paper tackles content-based misinformation detection under heterogeneous linguistic and domain conditions. It proposes a hybrid approach that fuses symbolic linguistic features with RoBERTa-based adapters to enhance robustness and efficiency without requiring additional training data. Through dataset characterization and feature-selection analyses, it demonstrates the predictive value of symbolic features and their complementary role alongside neural representations. The proposed AdapterF model achieves state-of-the-art results across multiple datasets and shows strong generalization under domain shifts, highlighting the practical value of integrating structured knowledge into language models.
Abstract
Preventing the spread of misinformation is challenging. The detection of misleading content presents a significant hurdle due to its extreme linguistic and domain variability. Content-based models have managed to identify deceptive language by learning representations from textual data such as social media posts and web articles. However, aggregating representative samples of this heterogeneous phenomenon and implementing effective real-world applications is still elusive. Based on analytical work on the language of misinformation, this paper analyzes the linguistic attributes that characterize this phenomenon and how representative of such features some of the most popular misinformation datasets are. We demonstrate that the appropriate use of pertinent symbolic knowledge in combination with neural language models is helpful in detecting misleading content. Our results achieve state-of-the-art performance in misinformation datasets across the board, showing that our approach offers a valid and robust alternative to multi-task transfer learning without requiring any additional training data. Furthermore, our results show evidence that structured knowledge can provide the extra boost required to address a complex and unpredictable real-world problem like misinformation detection, not only in terms of accuracy but also time efficiency and resource utilization.
