Table of Contents
Fetching ...

FastSpell: the LangId Magic Spell

Marta Bañón, Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Sergio Ortiz-Rojas

TL;DR

FastSpell addresses the challenge of distinguishing closely related languages in multilingual corpora by adding a second-opinion step to existing language identifiers. It uses fastText to produce an initial prediction and then applies Hunspell-based spell checking on the targeted language and its similar languages to refine the decision, optionally yielding an 'unknown' label in conservative mode. Benchmarks against multiple identifiers demonstrate that fastText is fast and effective, while FastSpell improves accuracy for hard cases such as Montenegrin and Norwegian Nynorsk. The approach is openly available under GPLv3 and integrated into Bitextor/Monotextor pipelines, enabling more reliable language resource creation for large-scale web-crawled data. The work outlines configurable resources and future directions like alternative models, improved tokenization, and multi-language dictionaries.

Abstract

Language identification is a crucial component in the automated production of language resources, particularly in multilingual and big data contexts. However, commonly used language identifiers struggle to differentiate between similar or closely-related languages. This paper introduces FastSpell, a language identifier that combines fastText (a pre-trained language identifier tool) and Hunspell (a spell checker) with the aim of having a refined second-opinion before deciding which language should be assigned to a text. We provide a description of the FastSpell algorithm along with an explanation on how to use and configure it. To that end, we motivate the need of such a tool and present a benchmark including some popular language identifiers evaluated during the development of FastSpell. We show how FastSpell is useful not only to improve identification of similar languages, but also to identify new ones ignored by other tools.

FastSpell: the LangId Magic Spell

TL;DR

FastSpell addresses the challenge of distinguishing closely related languages in multilingual corpora by adding a second-opinion step to existing language identifiers. It uses fastText to produce an initial prediction and then applies Hunspell-based spell checking on the targeted language and its similar languages to refine the decision, optionally yielding an 'unknown' label in conservative mode. Benchmarks against multiple identifiers demonstrate that fastText is fast and effective, while FastSpell improves accuracy for hard cases such as Montenegrin and Norwegian Nynorsk. The approach is openly available under GPLv3 and integrated into Bitextor/Monotextor pipelines, enabling more reliable language resource creation for large-scale web-crawled data. The work outlines configurable resources and future directions like alternative models, improved tokenization, and multi-language dictionaries.

Abstract

Language identification is a crucial component in the automated production of language resources, particularly in multilingual and big data contexts. However, commonly used language identifiers struggle to differentiate between similar or closely-related languages. This paper introduces FastSpell, a language identifier that combines fastText (a pre-trained language identifier tool) and Hunspell (a spell checker) with the aim of having a refined second-opinion before deciding which language should be assigned to a text. We provide a description of the FastSpell algorithm along with an explanation on how to use and configure it. To that end, we motivate the need of such a tool and present a benchmark including some popular language identifiers evaluated during the development of FastSpell. We show how FastSpell is useful not only to improve identification of similar languages, but also to identify new ones ignored by other tools.
Paper Structure (13 sections, 3 figures, 2 tables, 1 algorithm)

This paper contains 13 sections, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: fastText confusion matrices for some groups of similar languages.
  • Figure 2: First lines of the default similar.yaml file
  • Figure 3: The FastSpell Algorithm