Table of Contents
Fetching ...

Estimating Gender Completeness in Wikipedia

Hrishikesh Patel, Tianwa Chen, Ivano Bongiovanni, Gianluca Demartini

TL;DR

The paper addresses gender completeness in Wikipedia by estimating gender-based cardinalities for sub-classes of Person using capture/recapture estimators applied to Wikipedia edit histories, complemented by a name-based gender classifier. It builds a dataset from DBpedia and Wikipedia edits (121,535 entities across 34 classes; 4,896,299 edits) and uses the estimators J1 and N1_UNIF, with a 7-day window identified as optimal. The results reveal varying completeness across genders and sub-classes, for example female engineers at about 78% complete and male engineers around 90%, highlighting persistent coverage gaps. The approach provides editors with quantitative tools to inform editorial priorities and track progress toward more balanced representation.

Abstract

Gender imbalance in Wikipedia content is a known challenge which the editor community is actively addressing. The aim of this paper is to provide the Wikipedia community with instruments to estimate the magnitude of the problem for different entity types (also known as classes) in Wikipedia. To this end, we apply class completeness estimation methods based on the gender attribute. Our results show not only which gender for different sub-classes of Person is more prevalent in Wikipedia, but also an idea of how complete the coverage is for difference genders and sub-classes of Person.

Estimating Gender Completeness in Wikipedia

TL;DR

The paper addresses gender completeness in Wikipedia by estimating gender-based cardinalities for sub-classes of Person using capture/recapture estimators applied to Wikipedia edit histories, complemented by a name-based gender classifier. It builds a dataset from DBpedia and Wikipedia edits (121,535 entities across 34 classes; 4,896,299 edits) and uses the estimators J1 and N1_UNIF, with a 7-day window identified as optimal. The results reveal varying completeness across genders and sub-classes, for example female engineers at about 78% complete and male engineers around 90%, highlighting persistent coverage gaps. The approach provides editors with quantitative tools to inform editorial priorities and track progress toward more balanced representation.

Abstract

Gender imbalance in Wikipedia content is a known challenge which the editor community is actively addressing. The aim of this paper is to provide the Wikipedia community with instruments to estimate the magnitude of the problem for different entity types (also known as classes) in Wikipedia. To this end, we apply class completeness estimation methods based on the gender attribute. Our results show not only which gender for different sub-classes of Person is more prevalent in Wikipedia, but also an idea of how complete the coverage is for difference genders and sub-classes of Person.
Paper Structure (10 sections, 4 figures)

This paper contains 10 sections, 4 figures.

Figures (4)

  • Figure 1: Name-based gender classification of Wikipedia person names per sub-class of Person.
  • Figure 2: Class cardinality estimation for some example classes (classes estimated to be most incomplete in the top row, and classes estimated to be most complete in the bottom row) with different statistical estimators.
  • Figure 3: The impact of window size over the edit history for capture/recapture-based estimation methods.
  • Figure 4: Gender-based entity count in Wikipedia, cardinality estimation (Est.), convergence score (Conve.), and estimated class completeness (Compl.) for estimators N1_UNIF (N1) and Jack1 (J1).