Table of Contents
Fetching ...

Matina: A Large-Scale 73B Token Persian Text Corpus

Sara Bourbour Hosseinbeigi, Fatemeh Taherinezhad, Heshaam Faili, Hamed Baghbani, Fatemeh Nadi, Mostafa Amiri

TL;DR

The paper tackles the underrepresentation of Persian in NLP by introducing Matina, a large-scale Persian corpus totaling 72.9B tokens, built from web, books/papers, and social media with a rigorous preprocessing and deduplication pipeline. It demonstrates the corpus’s value by continual pretraining an MLM (XLM-RoBERTa Large) and assessing domain-adaptive pretraining for LLMs (LLaMA 3.2 Instruct 8B) across Persian tasks and domains, yielding measurable improvements. By releasing both the dataset and preprocessing code, the work provides a scalable resource to advance Persian NLP, including downstream tasks like translation, sentiment analysis, and NER, and supports broader multilingual modeling efforts. The study also highlights practical considerations such as data quality, language-specific filtering, and domain relevance, underscoring Matina’s potential to accelerate open-source Persian LLM development.

Abstract

Text corpora are essential for training models used in tasks like summarization, translation, and large language models (LLMs). While various efforts have been made to collect monolingual and multilingual datasets in many languages, Persian has often been underrepresented due to limited resources for data collection and preprocessing. Existing Persian datasets are typically small and lack content diversity, consisting mainly of weblogs and news articles. This shortage of high-quality, varied data has slowed the development of NLP models and open-source LLMs for Persian. Since model performance depends heavily on the quality of training data, we address this gap by introducing the Matina corpus, a new Persian dataset of 72.9B tokens, carefully preprocessed and deduplicated to ensure high data quality. We further assess its effectiveness by training and evaluating transformer-based models on key NLP tasks. Both the dataset and preprocessing codes are publicly available, enabling researchers to build on and improve this resource for future Persian NLP advancements.

Matina: A Large-Scale 73B Token Persian Text Corpus

TL;DR

The paper tackles the underrepresentation of Persian in NLP by introducing Matina, a large-scale Persian corpus totaling 72.9B tokens, built from web, books/papers, and social media with a rigorous preprocessing and deduplication pipeline. It demonstrates the corpus’s value by continual pretraining an MLM (XLM-RoBERTa Large) and assessing domain-adaptive pretraining for LLMs (LLaMA 3.2 Instruct 8B) across Persian tasks and domains, yielding measurable improvements. By releasing both the dataset and preprocessing code, the work provides a scalable resource to advance Persian NLP, including downstream tasks like translation, sentiment analysis, and NER, and supports broader multilingual modeling efforts. The study also highlights practical considerations such as data quality, language-specific filtering, and domain relevance, underscoring Matina’s potential to accelerate open-source Persian LLM development.

Abstract

Text corpora are essential for training models used in tasks like summarization, translation, and large language models (LLMs). While various efforts have been made to collect monolingual and multilingual datasets in many languages, Persian has often been underrepresented due to limited resources for data collection and preprocessing. Existing Persian datasets are typically small and lack content diversity, consisting mainly of weblogs and news articles. This shortage of high-quality, varied data has slowed the development of NLP models and open-source LLMs for Persian. Since model performance depends heavily on the quality of training data, we address this gap by introducing the Matina corpus, a new Persian dataset of 72.9B tokens, carefully preprocessed and deduplicated to ensure high data quality. We further assess its effectiveness by training and evaluating transformer-based models on key NLP tasks. Both the dataset and preprocessing codes are publicly available, enabling researchers to build on and improve this resource for future Persian NLP advancements.

Paper Structure

This paper contains 29 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The overall stages of processing pipeline of Matina Corpus.
  • Figure 2: Distribution of document length by source in the Matina Corpus. Length is determined by the log of the number of tokens using Llama3.1 dubey2024llama3.1 tokenizer.
  • Figure 3: Data reduction during preprocessing and deduplication varies significantly across sources. Social media shows the most drastic drop, with just 1.6% of documents remaining after deduplication, while other sources retain between 56.1% and 93.3%. The three bars for each source represent the percentage of documents left after each stage. Overall, about 14% of the initial documents remain.
  • Figure 4: Win rate of pretrained models over models without pretraining.
  • Figure 5: Document Length Distribution For Web-based Crawled Data
  • ...and 1 more figures