Table of Contents
Fetching ...

KréyoLID From Language Identification Towards Language Mining

Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, Benoît Sagot

TL;DR

The paper tackles the challenge of building corpora for low‑resource varieties by reframing language identification as a data‑mining problem. It introduces Language Mining, a two‑phase filtration that uses document‑level whitelists/blacklists and line‑level scoring to rapidly triage vast web crawls and extract high‑signal Creole content, particularly for French‑based Creoles. Through Wikipedia benchmarks and large real‑world corpora (MADLAD‑400, GlotCC, Fineweb‑2), the approach demonstrates strong recall with substantial speedups at scale on 21 TB of Common Crawl data, including first and second passes that refine the results. The work highlights practical benefits for rapid corpus construction and potential downstream tasks (e.g., language model training), while acknowledging limitations related to data transfer bottlenecks and language types where whitespace tokenization is less effective.

Abstract

Automatic language identification is frequently framed as a multi-class classification problem. However, when creating digital corpora for less commonly written languages, it may be more appropriate to consider it a data mining problem. For these varieties, one knows ahead of time that the vast majority of documents are of little interest. By minimizing resources spent on classifying such documents, we can create corpora much faster and with better coverage than using established pipelines. To demonstrate the effectiveness of the language mining perspective, we introduce a new pipeline and corpora for several French-based Creoles.

KréyoLID From Language Identification Towards Language Mining

TL;DR

The paper tackles the challenge of building corpora for low‑resource varieties by reframing language identification as a data‑mining problem. It introduces Language Mining, a two‑phase filtration that uses document‑level whitelists/blacklists and line‑level scoring to rapidly triage vast web crawls and extract high‑signal Creole content, particularly for French‑based Creoles. Through Wikipedia benchmarks and large real‑world corpora (MADLAD‑400, GlotCC, Fineweb‑2), the approach demonstrates strong recall with substantial speedups at scale on 21 TB of Common Crawl data, including first and second passes that refine the results. The work highlights practical benefits for rapid corpus construction and potential downstream tasks (e.g., language model training), while acknowledging limitations related to data transfer bottlenecks and language types where whitespace tokenization is less effective.

Abstract

Automatic language identification is frequently framed as a multi-class classification problem. However, when creating digital corpora for less commonly written languages, it may be more appropriate to consider it a data mining problem. For these varieties, one knows ahead of time that the vast majority of documents are of little interest. By minimizing resources spent on classifying such documents, we can create corpora much faster and with better coverage than using established pipelines. To demonstrate the effectiveness of the language mining perspective, we introduce a new pipeline and corpora for several French-based Creoles.

Paper Structure

This paper contains 28 sections, 1 figure, 5 tables, 1 algorithm.

Figures (1)

  • Figure 1: Language Mining Pipeline