Table of Contents
Fetching ...

Characterizing Knowledge Manipulation in a Russian Wikipedia Fork

Mykola Trokhymovych, Oleksandr Kosovan, Nathan Forrester, Pablo Aragón, Diego Saez-Trumper, Ricardo Baeza-Yates

TL;DR

This study analyzes a Russian Wikipedia fork (RWFork) to characterize how original Russian Wikipedia content may be manipulated to align with national regulations. It introduces a data-intensive methodology that compares 1.9 million article pairs, extracting text changes, categories, sources, and named entities, and uses NLP-driven clustering with GPT-4o-mini and embeddings to derive an eight-cluster taxonomy of manipulation patterns. The findings show that a small subset of highly viewed articles undergo changes concentrated on topics related to the Ukraine conflict, with systematic edits altering terminology, territorial designations, and sources, while a substantial portion of edits are non-textual metadata adjustments. The work highlights implications for knowledge integrity and the training data used for large language models, and it provides a replicable framework and open data to study other forks and similar platforms.

Abstract

Wikipedia is powered by MediaWiki, a free and open-source software that is also the infrastructure for many other wiki-based online encyclopedias. These include the recently launched website Ruwiki, which has copied and modified the original Russian Wikipedia content to conform to Russian law. To identify practices and narratives that could be associated with different forms of knowledge manipulation, this article presents an in-depth analysis of this Russian Wikipedia fork. We propose a methodology to characterize the main changes with respect to the original version. The foundation of this study is a comprehensive comparative analysis of more than 1.9M articles from Russian Wikipedia and its fork. Using meta-information and geographical, temporal, categorical, and textual features, we explore the changes made by Ruwiki editors. Furthermore, we present a classification of the main topics of knowledge manipulation in this fork, including a numerical estimation of their scope. This research not only sheds light on significant changes within Ruwiki, but also provides a methodology that could be applied to analyze other Wikipedia forks and similar collaborative projects.

Characterizing Knowledge Manipulation in a Russian Wikipedia Fork

TL;DR

This study analyzes a Russian Wikipedia fork (RWFork) to characterize how original Russian Wikipedia content may be manipulated to align with national regulations. It introduces a data-intensive methodology that compares 1.9 million article pairs, extracting text changes, categories, sources, and named entities, and uses NLP-driven clustering with GPT-4o-mini and embeddings to derive an eight-cluster taxonomy of manipulation patterns. The findings show that a small subset of highly viewed articles undergo changes concentrated on topics related to the Ukraine conflict, with systematic edits altering terminology, territorial designations, and sources, while a substantial portion of edits are non-textual metadata adjustments. The work highlights implications for knowledge integrity and the training data used for large language models, and it provides a replicable framework and open data to study other forks and similar platforms.

Abstract

Wikipedia is powered by MediaWiki, a free and open-source software that is also the infrastructure for many other wiki-based online encyclopedias. These include the recently launched website Ruwiki, which has copied and modified the original Russian Wikipedia content to conform to Russian law. To identify practices and narratives that could be associated with different forms of knowledge manipulation, this article presents an in-depth analysis of this Russian Wikipedia fork. We propose a methodology to characterize the main changes with respect to the original version. The foundation of this study is a comprehensive comparative analysis of more than 1.9M articles from Russian Wikipedia and its fork. Using meta-information and geographical, temporal, categorical, and textual features, we explore the changes made by Ruwiki editors. Furthermore, we present a classification of the main topics of knowledge manipulation in this fork, including a numerical estimation of their scope. This research not only sheds light on significant changes within Ruwiki, but also provides a methodology that could be applied to analyze other Wikipedia forks and similar collaborative projects.

Paper Structure

This paper contains 30 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Summary of our research that analyzes changes in a Russian Wikipedia fork to assess knowledge manipulation.
  • Figure 2: Process for crawling revision differences between RWFork and Russian Wikipedia.
  • Figure 3: Comparison of Russian Wikipedia pages statistics for the groups of changed, duplicated, and missing pages. Statistics used: (a) Monthly page views; (b) Edits count; (c) IP edits rate; and (d) Revert rate. Plots include mean values with 95% confidence intervals for corresponding statistics.
  • Figure 4: Average number of edits per day of week and hour of day in RWFork (top/blue) and Russian Wikipedia (bottom/red). The color intensity indicates the volume of edits, with darker shades representing higher activity.
  • Figure 5: Rates within groups of changed, duplicated, and missing pages in RWFork for the top 10 most frequent countries in changed group.
  • ...and 4 more figures