Table of Contents
Fetching ...

Language-Agnostic Modeling of Source Reliability on Wikipedia

Jacopo D'Ignazi, Andreas Kaltenbrunner, Yelena Mejova, Michele Tizzani, Kyriaki Kalimeri, Mariano Beiró, Pablo Aragón

TL;DR

This work tackles the challenge of verifying citation credibility across Wikipedia's multilingual editions by modeling source reliability with language-agnostic features derived from edit histories. It builds a large, multi-topic, multi-language dataset and trains an XGBoost classifier using 52 features spanning popularity, permanence, and editor activity, evaluated with leave-one-out and cross-language setups. The results show strong performance in English ($ extapprox 0.80$ F1 Macro) and reasonable accuracy in mid-resource languages, with permanence features being highly predictive; cross-language adaptation is most effective when leveraging data from multiple languages and applying normalization. The study demonstrates practical value for editors in under-resourced languages and offers insights into editorial behavior across cultures, while outlining limitations and directions for future cross-lingual transfer and time-aware labeling approaches.

Abstract

Over the last few years, verifying the credibility of information sources has become a fundamental need to combat disinformation. Here, we present a language-agnostic model designed to assess the reliability of web domains as sources in references across multiple language editions of Wikipedia. Utilizing editing activity data, the model evaluates domain reliability within different articles of varying controversiality, such as Climate Change, COVID-19, History, Media, and Biology topics. Crafting features that express domain usage across articles, the model effectively predicts domain reliability, achieving an F1 Macro score of approximately 0.80 for English and other high-resource languages. For mid-resource languages, we achieve 0.65, while the performance of low-resource languages varies. In all cases, the time the domain remains present in the articles (which we dub as permanence) is one of the most predictive features. We highlight the challenge of maintaining consistent model performance across languages of varying resource levels and demonstrate that adapting models from higher-resource languages can improve performance. We believe these findings can assist Wikipedia editors in their ongoing efforts to verify citations and may offer useful insights for other user-generated content communities.

Language-Agnostic Modeling of Source Reliability on Wikipedia

TL;DR

This work tackles the challenge of verifying citation credibility across Wikipedia's multilingual editions by modeling source reliability with language-agnostic features derived from edit histories. It builds a large, multi-topic, multi-language dataset and trains an XGBoost classifier using 52 features spanning popularity, permanence, and editor activity, evaluated with leave-one-out and cross-language setups. The results show strong performance in English ( F1 Macro) and reasonable accuracy in mid-resource languages, with permanence features being highly predictive; cross-language adaptation is most effective when leveraging data from multiple languages and applying normalization. The study demonstrates practical value for editors in under-resourced languages and offers insights into editorial behavior across cultures, while outlining limitations and directions for future cross-lingual transfer and time-aware labeling approaches.

Abstract

Over the last few years, verifying the credibility of information sources has become a fundamental need to combat disinformation. Here, we present a language-agnostic model designed to assess the reliability of web domains as sources in references across multiple language editions of Wikipedia. Utilizing editing activity data, the model evaluates domain reliability within different articles of varying controversiality, such as Climate Change, COVID-19, History, Media, and Biology topics. Crafting features that express domain usage across articles, the model effectively predicts domain reliability, achieving an F1 Macro score of approximately 0.80 for English and other high-resource languages. For mid-resource languages, we achieve 0.65, while the performance of low-resource languages varies. In all cases, the time the domain remains present in the articles (which we dub as permanence) is one of the most predictive features. We highlight the challenge of maintaining consistent model performance across languages of varying resource levels and demonstrate that adapting models from higher-resource languages can improve performance. We believe these findings can assist Wikipedia editors in their ongoing efforts to verify citations and may offer useful insights for other user-generated content communities.

Paper Structure

This paper contains 28 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Beeswarm analysis of an XGBoost model trained on Climate Change English dataset (class encoding is 1: reliable and 0: unreliable) and two example classification explanations (score shown is the log-odds ratio).
  • Figure 2: Performances (mean $\pm$ stdv) of models trained and validated in languages other than English on the Climate Change datasets. The three batches of languages, separated by a vertical black line, show results on a sample of high (left), mid (center), and low (right) resource languages.
  • Figure 3: Native model performance per topic for different language types. Top panel: % of models that perform better than a random classifier. Bottom panel: Violin plots of corresponding distributions of F1 macro performances.
  • Figure 4: Average model F1 macro (line) and standard dev. (shaded area) versus the size of the training dataset in terms of the number of revisions. In English (red), all topics are combined, and the data is sampled at regular intervals. The same experiment is repeated, considering only revisions before the implementation of the perennial sources on 2018-07-01 (grey). The performance of models in all other languages (also all-topic) is shown in blue, without sub-sampling.
  • Figure 5: Model performance in two adaptation scenarios: cross-language setting (trained on a language, tested on a different language from the same class, in the same topic, in red) and a cross-topic setting (trained on a topic, tested on a different topic, in the same language, in green). Results aggregated by language resourcefulness and compared to same-topic and language native models (in blue).
  • ...and 1 more figures