Table of Contents
Fetching ...

GlotScript: A Resource and Tool for Low Resource Writing System Identification

Amir Hossein Kargaran, François Yvon, Hinrich Schütze

TL;DR

GlotScript tackles the challenge of low-resource script identification by introducing GlotScript-R, a wide-coverage resource of writing-system metadata for over 7,000 language varieties, and GlotScript-T, a fast tool that identifies script distributions for all 161 Unicode 15.0 scripts using ISO 15924 codes. By merging multiple metadata sources and defining CORE and AUXILIARY categories, GlotScript provides robust, nuanced script labeling that enhances corpus cleaning and enables thorough analysis of multilingual model tokenization and UDHR translations. Empirical results show high corpus-quality scores in OSCAR and mC4, diverse script coverage across ten state-of-the-art multilingual models, and clear evidence of script-aware tokenization costs, underscoring the practical impact for low-resource language NLP and LLM deployment. The work offers open resources for the NLP community and argues for per-sentence script metadata to prevent errors and improve data quality across language resources and models.

Abstract

We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it returns its script distribution where scripts are identified by ISO 15924 codes. We also present two use cases for GlotScript. First, we demonstrate that GlotScript can help cleaning multilingual corpora such as mC4 and OSCAR. Second, we analyze the tokenization of a number of language models such as GPT-4 using GlotScript and provide insights on the coverage of low resource scripts and languages by each language model. We hope that GlotScript will become a useful resource for work on low resource languages in the NLP community. GlotScript-R and GlotScript-T are available at https://github.com/cisnlp/GlotScript.

GlotScript: A Resource and Tool for Low Resource Writing System Identification

TL;DR

GlotScript tackles the challenge of low-resource script identification by introducing GlotScript-R, a wide-coverage resource of writing-system metadata for over 7,000 language varieties, and GlotScript-T, a fast tool that identifies script distributions for all 161 Unicode 15.0 scripts using ISO 15924 codes. By merging multiple metadata sources and defining CORE and AUXILIARY categories, GlotScript provides robust, nuanced script labeling that enhances corpus cleaning and enables thorough analysis of multilingual model tokenization and UDHR translations. Empirical results show high corpus-quality scores in OSCAR and mC4, diverse script coverage across ten state-of-the-art multilingual models, and clear evidence of script-aware tokenization costs, underscoring the practical impact for low-resource language NLP and LLM deployment. The work offers open resources for the NLP community and argues for per-sentence script metadata to prevent errors and improve data quality across language resources and models.

Abstract

We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it returns its script distribution where scripts are identified by ISO 15924 codes. We also present two use cases for GlotScript. First, we demonstrate that GlotScript can help cleaning multilingual corpora such as mC4 and OSCAR. Second, we analyze the tokenization of a number of language models such as GPT-4 using GlotScript and provide insights on the coverage of low resource scripts and languages by each language model. We hope that GlotScript will become a useful resource for work on low resource languages in the NLP community. GlotScript-R and GlotScript-T are available at https://github.com/cisnlp/GlotScript.
Paper Structure (27 sections, 1 equation, 3 figures, 2 tables)

This paper contains 27 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: How to use GlotScript-T: three examples. GlotScript-T returns a tuple consisting of the main script, the percentage of characters in the main script and detailed information on the distribution of scripts.
  • Figure 2: The percentage of each script in the vocabulary of model tokenizers. Scripts with a presence of more than 1% in each tokenizer are text-labeled in the figure.
  • Figure 3: Analysis of the multilinguality of the tokenization of ten language models. This analysis was performed on 396 UDHR translations. Left: the number of tokens into which the UDHR translation is tokenized. We omit a pair of tokenizer and translation with more than 5% unknown tokens. Right: the percentage of unknown tokens generated for a pair of tokenizer and translation.