Table of Contents
Fetching ...

Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia

Natallia Kokash, Giovanni Colavizza

TL;DR

This paper presents an open-source software pipeline for retrieving, classifying, and disambiguating citations on demand from a given Wikipedia dump, and demonstrates its usability.

Abstract

Wikipedia is an essential component of the open science ecosystem, yet it is poorly integrated with academic open science initiatives. Wikipedia Citations is a project that focuses on extracting and releasing comprehensive datasets of citations from Wikipedia. A total of 29.3 million citations were extracted from English Wikipedia in May 2020. Following this one-off research project, we designed a reproducible pipeline that can process any given Wikipedia dump in the cloud-based settings. To demonstrate its usability, we extracted 40.6 million citations in February 2023 and 44.7 million citations in February 2024. Furthermore, we equipped the pipeline with an adapted Wikipedia citation template translation module to process multilingual Wikipedia articles in 15 European languages so that they are parsed and mapped into a generic structured citation template. This paper presents our open-source software pipeline to retrieve, classify, and disambiguate citations on demand from a given Wikipedia dump.

Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia

TL;DR

This paper presents an open-source software pipeline for retrieving, classifying, and disambiguating citations on demand from a given Wikipedia dump, and demonstrates its usability.

Abstract

Wikipedia is an essential component of the open science ecosystem, yet it is poorly integrated with academic open science initiatives. Wikipedia Citations is a project that focuses on extracting and releasing comprehensive datasets of citations from Wikipedia. A total of 29.3 million citations were extracted from English Wikipedia in May 2020. Following this one-off research project, we designed a reproducible pipeline that can process any given Wikipedia dump in the cloud-based settings. To demonstrate its usability, we extracted 40.6 million citations in February 2023 and 44.7 million citations in February 2024. Furthermore, we equipped the pipeline with an adapted Wikipedia citation template translation module to process multilingual Wikipedia articles in 15 European languages so that they are parsed and mapped into a generic structured citation template. This paper presents our open-source software pipeline to retrieve, classify, and disambiguate citations on demand from a given Wikipedia dump.
Paper Structure (8 sections, 3 figures, 4 tables)

This paper contains 8 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Wikipedia citation representation Singh2020.
  • Figure 2: Citation extraction process Singh2020.
  • Figure 3: Citation translation process.