Identification and explanation of disinformation in wiki data streams
Francisco de Arriba-Pérez, Silvia García-Méndez, Fátima Leal, Benedita Malheiro, Juan C Burguillo
TL;DR
The paper addresses the challenge of identifying disinformation in wiki data streams and the need for real-time, interpretable validation of crowd-generated content. It proposes a streaming framework that combines stream-based data processing, extensive feature engineering (including historical and side features), and online classifiers (Gaussian Naive Bayes, ALMA, HATC, ARFC) with a real-time explainability dashboard. A key novelty is the integration of an LLM-powered natural-language explanation and an expert-in-the-loop to improve trust and accountability. Experiments on Wikivoyage and Wikipedia demonstrate robust performance (around 90% across metrics) and real-time processing, suggesting practical utility for editors and content moderators.
Abstract
Social media platforms, increasingly used as news sources for varied data analytics, have transformed how information is generated and disseminated. However, the unverified nature of this content raises concerns about trustworthiness and accuracy, potentially negatively impacting readers' critical judgment due to disinformation. This work aims to contribute to the automatic data quality validation field, addressing the rapid growth of online content on wiki pages. Our scalable solution includes stream-based data processing with feature engineering, feature analysis and selection, stream-based classification, and real-time explanation of prediction outcomes. The explainability dashboard is designed for the general public, who may need more specialized knowledge to interpret the model's prediction. Experimental results on two datasets attain approximately 90 % values across all evaluation metrics, demonstrating robust and competitive performance compared to works in the literature. In summary, the system assists editors by reducing their effort and time in detecting disinformation.
