Table of Contents
Fetching ...

Geographically-Informed Language Identification

Jonathan Dunn, Lane Edwards-Brown

TL;DR

This work tackles language identification for broad multilingual coverage by leveraging geographic priors. It constructs 16 region-specific lid models, each including local languages plus 31 international linguae francae, using a fastText-based architecture. Upstream evaluations show notable f-score gains (1.7–10.4 points) across regions, validated against OpenLID data, and down-stream tests on 189 million geo-tagged tweets demonstrate meaningful downstream impact with ~13% label changes in large corpora, especially for low-resource languages. The approach yields 916 languages at 50-character samples and enables higher-quality multilingual corpora, with the GeoLID toolkit released for broad use and future audits.

Abstract

This paper develops an approach to language identification in which the set of languages considered by the model depends on the geographic origin of the text in question. Given that many digital corpora can be geo-referenced at the country level, this paper formulates 16 region-specific models, each of which contains the languages expected to appear in countries within that region. These regional models also each include 31 widely-spoken international languages in order to ensure coverage of these linguae francae regardless of location. An upstream evaluation using traditional language identification testing data shows an improvement in f-score ranging from 1.7 points (Southeast Asia) to as much as 10.4 points (North Africa). A downstream evaluation on social media data shows that this improved performance has a significant impact on the language labels which are applied to large real-world corpora. The result is a highly-accurate model that covers 916 languages at a sample size of 50 characters, the performance improved by incorporating geographic information into the model.

Geographically-Informed Language Identification

TL;DR

This work tackles language identification for broad multilingual coverage by leveraging geographic priors. It constructs 16 region-specific lid models, each including local languages plus 31 international linguae francae, using a fastText-based architecture. Upstream evaluations show notable f-score gains (1.7–10.4 points) across regions, validated against OpenLID data, and down-stream tests on 189 million geo-tagged tweets demonstrate meaningful downstream impact with ~13% label changes in large corpora, especially for low-resource languages. The approach yields 916 languages at 50-character samples and enables higher-quality multilingual corpora, with the GeoLID toolkit released for broad use and future audits.

Abstract

This paper develops an approach to language identification in which the set of languages considered by the model depends on the geographic origin of the text in question. Given that many digital corpora can be geo-referenced at the country level, this paper formulates 16 region-specific models, each of which contains the languages expected to appear in countries within that region. These regional models also each include 31 widely-spoken international languages in order to ensure coverage of these linguae francae regardless of location. An upstream evaluation using traditional language identification testing data shows an improvement in f-score ranging from 1.7 points (Southeast Asia) to as much as 10.4 points (North Africa). A downstream evaluation on social media data shows that this improved performance has a significant impact on the language labels which are applied to large real-world corpora. The result is a highly-accurate model that covers 916 languages at a sample size of 50 characters, the performance improved by incorporating geographic information into the model.
Paper Structure (9 sections, 1 figure, 9 tables)

This paper contains 9 sections, 1 figure, 9 tables.

Figures (1)

  • Figure 1: Map showing agreement between language identification models by country. A value of 0.80 means that 80% of samples receive the same language label from each model. Agreement is calculated using approximately 1 million random tweets per country, where each tweets has at least 50 characters.