Table of Contents
Fetching ...

Language-Agnostic Modeling of Wikipedia Articles for Content Quality Assessment across Languages

Paramita Das, Isaac Johnson, Diego Saez-Trumper, Pablo Aragón

TL;DR

This work tackles cross-language quality assessment for Wikipedia by proposing a language-agnostic framework built on universal structural features derived from Wikitext. It introduces a two-stage pipeline that learns feature weights and applies language-specific normalization against high-quality references, enabling 0–1 quality scoring across 300+ languages. The authors generate and publicly release large-scale datasets: over 2 billion revisions with language-agnostic features and a corresponding set of predicted quality scores, plus mappings to English quality labels for evaluation. Benchmarking against ORES and a Random Forest baseline demonstrates that language-agnostic features offer meaningful predictive value across languages and can complement language-dependent baselines. The work advances knowledge equity by enabling quality analysis in languages with limited or no bespoke assessment schemes and provides resources to support downstream research while adhering to FAIR principles.

Abstract

Wikipedia is the largest web repository of free knowledge. Volunteer editors devote time and effort to creating and expanding articles in more than 300 language editions. As content quality varies from article to article, editors also spend substantial time rating articles with specific criteria. However, keeping these assessments complete and up-to-date is largely impossible given the ever-changing nature of Wikipedia. To overcome this limitation, we propose a novel computational framework for modeling the quality of Wikipedia articles. State-of-the-art approaches to model Wikipedia article quality have leveraged machine learning techniques with language-specific features. In contrast, our framework is based on language-agnostic structural features extracted from the articles, a set of universal weights, and a language version-specific normalization criterion. Therefore, we ensure that all language editions of Wikipedia can benefit from our framework, even those that do not have their own quality assessment scheme. Using this framework, we have built datasets with the feature values and quality scores of all revisions of all articles in the existing language versions of Wikipedia. We provide a descriptive analysis of these resources and a benchmark of our framework. In addition, we discuss possible downstream tasks to be addressed with these datasets, which are released for public use.

Language-Agnostic Modeling of Wikipedia Articles for Content Quality Assessment across Languages

TL;DR

This work tackles cross-language quality assessment for Wikipedia by proposing a language-agnostic framework built on universal structural features derived from Wikitext. It introduces a two-stage pipeline that learns feature weights and applies language-specific normalization against high-quality references, enabling 0–1 quality scoring across 300+ languages. The authors generate and publicly release large-scale datasets: over 2 billion revisions with language-agnostic features and a corresponding set of predicted quality scores, plus mappings to English quality labels for evaluation. Benchmarking against ORES and a Random Forest baseline demonstrates that language-agnostic features offer meaningful predictive value across languages and can complement language-dependent baselines. The work advances knowledge equity by enabling quality analysis in languages with limited or no bespoke assessment schemes and provides resources to support downstream research while adhering to FAIR principles.

Abstract

Wikipedia is the largest web repository of free knowledge. Volunteer editors devote time and effort to creating and expanding articles in more than 300 language editions. As content quality varies from article to article, editors also spend substantial time rating articles with specific criteria. However, keeping these assessments complete and up-to-date is largely impossible given the ever-changing nature of Wikipedia. To overcome this limitation, we propose a novel computational framework for modeling the quality of Wikipedia articles. State-of-the-art approaches to model Wikipedia article quality have leveraged machine learning techniques with language-specific features. In contrast, our framework is based on language-agnostic structural features extracted from the articles, a set of universal weights, and a language version-specific normalization criterion. Therefore, we ensure that all language editions of Wikipedia can benefit from our framework, even those that do not have their own quality assessment scheme. Using this framework, we have built datasets with the feature values and quality scores of all revisions of all articles in the existing language versions of Wikipedia. We provide a descriptive analysis of these resources and a benchmark of our framework. In addition, we discuss possible downstream tasks to be addressed with these datasets, which are released for public use.
Paper Structure (17 sections, 4 figures, 3 tables)

This paper contains 17 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: English Wikipedia article about Catalan Rumba.
  • Figure 2: Box plots of the feature distributions for the top 9 Wikipedia language versions by editing activity: English (en), German (de), French (fr), Spanish (es), Italian (it), Russian (ru), Japanese (ja), Chinese (zh) and Vietnamese (vi). Each box plot represents the distribution of feature values of the latest revision of each article in a given language version.
  • Figure 3: Box plots of predicted article quality over time for the top 9 Wikipedia language editions by editing activity. Each box represents the predicted quality scores of the latest revision up to a given year of each article in a given language edition of Wikipedia. Color darkness corresponds to the time dimension, the darker the more recent.
  • Figure 4: Confusion matrix of class distributions of the articles as predicted by our model for (a) English Wikipedia and (b) French Wikipedia.