Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing

Isaac Johnson; Lucie-Aimée Kaffee; Miriam Redi

Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing

Isaac Johnson, Lucie-Aimée Kaffee, Miriam Redi

TL;DR

The paper addresses how Wikimedia data can be better aligned with NLP needs of Wikimedia editors by surveying data from pre-training, post-training, to evaluation. It presents a principled approach to evaluating dataset usefulness through principles such as multilinguality, core Wikimedia policies, and openness, and provides a case-study workflow from raw dumps to benchmarks. The review catalogs existing pre-training sources (e.g., Wikipedia, Wikisource, Wikidata-related content) and post-training tasks (classification, recommendation, text generation), highlighting gaps like limited multimodal data and English-dominant benchmarks. The authors call for more multilingual, open, and compact models and for creating editor-centered benchmarks (e.g., FreshWiki) to better serve Wikimedia editors and ensure responsible AI use.

Abstract

Wikimedia content is used extensively by the AI community and within the language modeling community in particular. In this paper, we provide a review of the different ways in which Wikimedia data is curated to use in NLP tasks across pre-training, post-training, and model evaluations. We point to opportunities for greater use of Wikimedia content but also identify ways in which the language modeling community could better center the needs of Wikimedia editors. In particular, we call for incorporating additional sources of Wikimedia data, a greater focus on benchmarks for LLMs that encode Wikimedia principles, and greater multilingualism in Wikimedia-derived datasets.

Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing

TL;DR

Abstract

Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing

TL;DR

Abstract

Paper Structure

Table of Contents