Table of Contents
Fetching ...

FuLG: 150B Romanian Corpus for Language Model Pretraining

Vlad-Andrei Bădoiu, Mihai-Valentin Dumitru, Alexandru M. Gherghescu, Alexandru Agache, Costin Raiciu

TL;DR

FuLG tackles the underrepresentation of Romanian in open LLM training data by constructing a large, open Romanian corpus from CommonCrawl. The authors implement a reproducible data pipeline with data acquisition (CCNet), deduplication, and multi-layer quality filtering, producing 156B tokens (589GB tokenized) and 220B tokens with the Llama 3 tokenizer. An ablation study comparing FuLG to OSCAR and mC4 on a 1B decoder-only model shows competitive perplexities and suggests FuLG's potential for task-specific advantages, complemented by qualitative story-generation analyses. The work contributes a large, openly available Romanian dataset and a documented filtering methodology that can be adapted to other underrepresented languages, advancing democratization of open-language-model development.

Abstract

Research in the field of language models is rapidly evolving, with many open models being released to the public. Openly available pretraining corpora usually focus on only a handful of languages, with many others either missing completely or extremely underrepresented. In this report, we introduce FuLG, a hundred-fifty-billion-token Romanian corpus extracted from CommonCrawl. We present our methodology for filtering FuLG and compare it via ablation studies against existing Romanian corpora.

FuLG: 150B Romanian Corpus for Language Model Pretraining

TL;DR

FuLG tackles the underrepresentation of Romanian in open LLM training data by constructing a large, open Romanian corpus from CommonCrawl. The authors implement a reproducible data pipeline with data acquisition (CCNet), deduplication, and multi-layer quality filtering, producing 156B tokens (589GB tokenized) and 220B tokens with the Llama 3 tokenizer. An ablation study comparing FuLG to OSCAR and mC4 on a 1B decoder-only model shows competitive perplexities and suggests FuLG's potential for task-specific advantages, complemented by qualitative story-generation analyses. The work contributes a large, openly available Romanian dataset and a documented filtering methodology that can be adapted to other underrepresented languages, advancing democratization of open-language-model development.

Abstract

Research in the field of language models is rapidly evolving, with many open models being released to the public. Openly available pretraining corpora usually focus on only a handful of languages, with many others either missing completely or extremely underrepresented. In this report, we introduce FuLG, a hundred-fifty-billion-token Romanian corpus extracted from CommonCrawl. We present our methodology for filtering FuLG and compare it via ablation studies against existing Romanian corpora.
Paper Structure (9 sections, 2 tables)