Table of Contents
Fetching ...

Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition

Saied Alshahrani, Hesham Haroon, Ali Elfilali, Mariama Njie, Jeanna Matthews

TL;DR

This work tackles the problem that template-based translation inflates the Egyptian Arabic Wikipedia with low-quality, culturally unrepresentative content, potentially biasing NLP models. The authors perform exploratory analysis across AR, ARZ, and ARY to characterize content density, quality, and human involvement, and build multivariate classifiers that fuse article metadata with embeddings to detect template-translated articles in ARZ. They find that ensemble methods, particularly XGBoost, achieve top performance, and they publicly deploy an online detector—the Egyptian Wikipedia Scanner—while releasing labeled datasets and code to support reproducibility. The study highlights important societal and representation concerns and demonstrates how metadata-driven signals can reliably identify template-driven content, enabling more trustworthy NLP resources for Arabic languages. This work thus provides practical tools for data curation and contributes to more culturally faithful language technology development for Arabic.

Abstract

Wikipedia articles (content pages) are commonly used corpora in Natural Language Processing (NLP) research, especially in low-resource languages other than English. Yet, a few research studies have studied the three Arabic Wikipedia editions, Arabic Wikipedia (AR), Egyptian Arabic Wikipedia (ARZ), and Moroccan Arabic Wikipedia (ARY), and documented issues in the Egyptian Arabic Wikipedia edition regarding the massive automatic creation of its articles using template-based translation from English to Arabic without human involvement, overwhelming the Egyptian Arabic Wikipedia with articles that do not only have low-quality content but also with articles that do not represent the Egyptian people, their culture, and their dialect. In this paper, we aim to mitigate the problem of template translation that occurred in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics through exploratory analysis and building automatic detection systems. We first explore the content of the three Arabic Wikipedia editions in terms of density, quality, and human contributions and utilize the resulting insights to build multivariate machine learning classifiers leveraging articles' metadata to detect the template-translated articles automatically. We then publicly deploy and host the best-performing classifier, XGBoost, as an online application called EGYPTIAN WIKIPEDIA SCANNER and release the extracted, filtered, and labeled datasets to the research community to benefit from our datasets and the online, web-based detection system.

Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition

TL;DR

This work tackles the problem that template-based translation inflates the Egyptian Arabic Wikipedia with low-quality, culturally unrepresentative content, potentially biasing NLP models. The authors perform exploratory analysis across AR, ARZ, and ARY to characterize content density, quality, and human involvement, and build multivariate classifiers that fuse article metadata with embeddings to detect template-translated articles in ARZ. They find that ensemble methods, particularly XGBoost, achieve top performance, and they publicly deploy an online detector—the Egyptian Wikipedia Scanner—while releasing labeled datasets and code to support reproducibility. The study highlights important societal and representation concerns and demonstrates how metadata-driven signals can reliably identify template-driven content, enabling more trustworthy NLP resources for Arabic languages. This work thus provides practical tools for data curation and contributes to more culturally faithful language technology development for Arabic.

Abstract

Wikipedia articles (content pages) are commonly used corpora in Natural Language Processing (NLP) research, especially in low-resource languages other than English. Yet, a few research studies have studied the three Arabic Wikipedia editions, Arabic Wikipedia (AR), Egyptian Arabic Wikipedia (ARZ), and Moroccan Arabic Wikipedia (ARY), and documented issues in the Egyptian Arabic Wikipedia edition regarding the massive automatic creation of its articles using template-based translation from English to Arabic without human involvement, overwhelming the Egyptian Arabic Wikipedia with articles that do not only have low-quality content but also with articles that do not represent the Egyptian people, their culture, and their dialect. In this paper, we aim to mitigate the problem of template translation that occurred in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics through exploratory analysis and building automatic detection systems. We first explore the content of the three Arabic Wikipedia editions in terms of density, quality, and human contributions and utilize the resulting insights to build multivariate machine learning classifiers leveraging articles' metadata to detect the template-translated articles automatically. We then publicly deploy and host the best-performing classifier, XGBoost, as an online application called EGYPTIAN WIKIPEDIA SCANNER and release the extracted, filtered, and labeled datasets to the research community to benefit from our datasets and the online, web-based detection system.
Paper Structure (33 sections, 9 figures, 11 tables)

This paper contains 33 sections, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Visualizations of tokens and characters per article for each Arabic Wikipedia edition, displaying the total tokens and characters on the y-axes and articles on the x-axes, with plotting the mean lines.
  • Figure 2: Counts of top common/duplicate n-grams of each Arabic Wikipedia edition; log values/counts are only for top K=1 common/duplicate n-grams.
  • Figure 3: Visualizations displaying the percentage of article creators and editors in terms of their types, bots, and humans, and their number of contributions (article creations) in each Arabic Wikipedia edition.
  • Figure 4: A basic process chart demonstrating the studied input features: embeddings (two word embeddings of sizes 300 or 768), metadata (five metadata of articles), or both (embeddings + metadata).
  • Figure 5: A treemap showing the effectiveness of the heuristic rules for articles before the template-based translation in Egyptian Wikipedia, highlighting the number of articles filtered out by each rule.
  • ...and 4 more figures