naab: A ready-to-use plug-and-play corpus for Farsi

Sadra Sabouri; Elnaz Rahmati; Soroush Gooran; Hossein Sameti

naab: A ready-to-use plug-and-play corpus for Farsi

Sadra Sabouri, Elnaz Rahmati, Soroush Gooran, Hossein Sameti

TL;DR

This work tackles the scarcity of large-scale Farsi text data by introducing naab, the largest publicly available cleaned Farsi corpus (~130GB, >250M paragraphs, ~15B words) hosted on Hugging Face, along with naab-raw and a streaming pre-processing toolkit. The authors assemble naab from multiple base corpora (Persian NLP, OSCAR-fa, AGP, Telegram, LSCP) and implement a memory-efficient, streaming preprocessing pipeline that operates in $O(1)$ memory while handling large-scale text. They provide two dataset versions: naab-raw for custom cleaning and naab for ready-to-use training, with selective download support to accommodate storage constraints. A baseline analysis of word frequencies demonstrates the effect of stopwords on lexical content, underscoring naab’s utility for diverse NLP tasks including LLM pre-training, NER, POS tagging, summarization, ASR, and TTS in Farsi. Overall, naab aims to close the resource gap for Farsi NLP, enabling broader participation in open science and practical language technology development for low-resource languages.

Abstract

The rise of large language models (LLMs) has transformed numerous natural language processing (NLP) tasks, yet their performance in low and mid-resource languages, such as Farsi, still lags behind resource-rich languages like English. To address this gap, we introduce naab, the largest publicly available, cleaned, and ready-to-use Farsi textual corpus. naab consists of 130GB of data, comprising over 250 million paragraphs and 15 billion words. Named after the Farsi word NAAB (meaning "pure" or "high-grade"), this corpus is openly accessible via Hugging Face, offering researchers a valuable resource for Farsi NLP tasks. In addition to naab, we provide naab-raw, an unprocessed version of the dataset, along with a pre-processing toolkit that allows users to clean their custom corpora. These resources empower NLP researchers and practitioners, particularly those focusing on low-resource languages, to improve the performance of LLMs in their respective domains and bridge the gap between resource-rich and resource-poor languages.

naab: A ready-to-use plug-and-play corpus for Farsi

TL;DR

memory while handling large-scale text. They provide two dataset versions: naab-raw for custom cleaning and naab for ready-to-use training, with selective download support to accommodate storage constraints. A baseline analysis of word frequencies demonstrates the effect of stopwords on lexical content, underscoring naab’s utility for diverse NLP tasks including LLM pre-training, NER, POS tagging, summarization, ASR, and TTS in Farsi. Overall, naab aims to close the resource gap for Farsi NLP, enabling broader participation in open science and practical language technology development for low-resource languages.

Abstract

Paper Structure (19 sections, 1 figure, 2 tables)

This paper contains 19 sections, 1 figure, 2 tables.

Introduction
Materials and Methods
Base Corpus
Pre-process
Filtering Non-Farsi Characters
Unifying Arabic/Farsi Characters
White Spaces
Removing Short Lines
Results
naab-raw
naab
Experiments
Usage & Future Works
Conclusions
Limitations
...and 4 more sections

Figures (1)

Figure 1: The top 20 most common words in naab, along with their corresponding frequencies, are presented in two categories: a) including all words, and b) excluding stopwords. The words are ranked by their counts, with the horizontal axis showing the frequency and the vertical axis listing the words.

naab: A ready-to-use plug-and-play corpus for Farsi

TL;DR

Abstract

naab: A ready-to-use plug-and-play corpus for Farsi

Authors

TL;DR

Abstract

Table of Contents

Figures (1)