Table of Contents
Fetching ...

FineFreq: A Multilingual Character Frequency Dataset from Web-Scale Text

Binbin Xu

TL;DR

The paper tackles the shortage of large-scale, temporally resolved multilingual character-frequency data. It introduces FineFreq, a dataset derived from web-scale corpora (FineWeb and FineWeb2) that covers over 1900 languages from 2013 to 2025 and contains more than 96 trillion characters. The resource provides per-language aggregate and yearly character frequencies with Unicode metadata in CSV and Parquet formats, enabling detailed diachronic analyses and downstream NLP or typographic tasks. By preserving cross-script usage and avoiding aggressive filtering, FineFreq reflects real-world multilingual writing and is publicly available to support reproducibility and broad research use.

Abstract

We present FineFreq, a large-scale multilingual character frequency dataset derived from the FineWeb and FineWeb2 corpora, covering over 1900 languages and spanning 2013-2025. The dataset contains frequency counts for 96 trillion characters processed from 57 TB of compressed text. For each language, FineFreq provides per-character statistics with aggregate and year-level frequencies, allowing fine-grained temporal analysis. The dataset preserves naturally occurring multilingual features such as cross-script borrowings, emoji, and acronyms without applying artificial filtering. Each character entry includes Unicode metadata (category, script, block), enabling domain-specific or other downstream filtering and analysis. The full dataset is released in both CSV and Parquet formats, with associated metadata, available on GitHub and HuggingFace. https://github.com/Bin-2/FineFreq

FineFreq: A Multilingual Character Frequency Dataset from Web-Scale Text

TL;DR

The paper tackles the shortage of large-scale, temporally resolved multilingual character-frequency data. It introduces FineFreq, a dataset derived from web-scale corpora (FineWeb and FineWeb2) that covers over 1900 languages from 2013 to 2025 and contains more than 96 trillion characters. The resource provides per-language aggregate and yearly character frequencies with Unicode metadata in CSV and Parquet formats, enabling detailed diachronic analyses and downstream NLP or typographic tasks. By preserving cross-script usage and avoiding aggressive filtering, FineFreq reflects real-world multilingual writing and is publicly available to support reproducibility and broad research use.

Abstract

We present FineFreq, a large-scale multilingual character frequency dataset derived from the FineWeb and FineWeb2 corpora, covering over 1900 languages and spanning 2013-2025. The dataset contains frequency counts for 96 trillion characters processed from 57 TB of compressed text. For each language, FineFreq provides per-character statistics with aggregate and year-level frequencies, allowing fine-grained temporal analysis. The dataset preserves naturally occurring multilingual features such as cross-script borrowings, emoji, and acronyms without applying artificial filtering. Each character entry includes Unicode metadata (category, script, block), enabling domain-specific or other downstream filtering and analysis. The full dataset is released in both CSV and Parquet formats, with associated metadata, available on GitHub and HuggingFace. https://github.com/Bin-2/FineFreq

Paper Structure

This paper contains 4 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Left: Yearly disk sizes of FineWeb dumps (English only), v1.4.0. Right: Disk sizes of FineWeb2 multilingual data by top contributing languages, v2.1.0. In total 57.16 TB.
  • Figure 2: Top 25 characters by relative frequency for the top 10 non-CJK languages in the corpus.
  • Figure 3: Character cloud (top 200) for CJK languages.
  • Figure 4: Yearly relative frequency of top 9 characters in 7 major languages. Each subplot shows the temporal trend of one character's proportion relative to all characters in that language for a given year.