Table of Contents
Fetching ...

AraMix: Recycling, Refiltering, and Deduplicating to Deliver the Largest Arabic Pretraining Corpus

Sultan Alrashed, Francesco Orabona

TL;DR

AraMix addresses redundancy in Arabic pretraining data by aggregating seven public Arabic corpora, applying Arabic-specific quality filters, and performing cross-dataset deduplication. The approach demonstrates that roughly 60% of content is duplicated across sources and yields a final corpus on the order of 178B tokens across about 167-179M documents after processing. The work argues that prioritizing curation of existing data and incorporating novel data sources delivers greater returns than additional web crawling for low-resource languages. It delivers the largest heavily filtered Arabic pretraining corpus to date and highlights the importance of language-specific data processing for non-English corpora.

Abstract

We present AraMix, a deduplicated Arabic pretraining corpus containing approximately 178 billion tokens across 179 million documents. Rather than scraping the web again, AraMix demonstrates that substantial value lies in systematically reusing and curating existing pretraining datasets: we combine seven publicly available Arabic web datasets, apply quality filtering designed specifically for Arabic text to re-filter some datasets, and perform cross-dataset deduplication, both MinHash and sentence-level. This approach reveals that nearly 60% of tokens across these independently collected corpora are duplicates, redundancy that any new scraping efforts will reproduce. Our work suggests that for lower resource languages, investment in curation pipelines for existing data yields greater returns than additional web crawls, an approach that allowed us to curate the largest heavily filtered publicly available Arabic pretraining corpus.

AraMix: Recycling, Refiltering, and Deduplicating to Deliver the Largest Arabic Pretraining Corpus

TL;DR

AraMix addresses redundancy in Arabic pretraining data by aggregating seven public Arabic corpora, applying Arabic-specific quality filters, and performing cross-dataset deduplication. The approach demonstrates that roughly 60% of content is duplicated across sources and yields a final corpus on the order of 178B tokens across about 167-179M documents after processing. The work argues that prioritizing curation of existing data and incorporating novel data sources delivers greater returns than additional web crawling for low-resource languages. It delivers the largest heavily filtered Arabic pretraining corpus to date and highlights the importance of language-specific data processing for non-English corpora.

Abstract

We present AraMix, a deduplicated Arabic pretraining corpus containing approximately 178 billion tokens across 179 million documents. Rather than scraping the web again, AraMix demonstrates that substantial value lies in systematically reusing and curating existing pretraining datasets: we combine seven publicly available Arabic web datasets, apply quality filtering designed specifically for Arabic text to re-filter some datasets, and perform cross-dataset deduplication, both MinHash and sentence-level. This approach reveals that nearly 60% of tokens across these independently collected corpora are duplicates, redundancy that any new scraping efforts will reproduce. Our work suggests that for lower resource languages, investment in curation pipelines for existing data yields greater returns than additional web crawls, an approach that allowed us to curate the largest heavily filtered publicly available Arabic pretraining corpus.

Paper Structure

This paper contains 14 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Pairwise token overlap between source datasets (billions).
  • Figure 2: Number of documents left over after MinHash deduplication and sentece-level deduplication.