How Should We Model the Probability of a Language?

Rasul Dent; Pedro Ortiz Suarez; Thibault Clérice; Benoît Sagot

How Should We Model the Probability of a Language?

Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, Benoît Sagot

TL;DR

This paper argues that broad LID coverage is hampered by a decontextualized framing that binds language identification to a single global prior over a fixed label set. It reframes LID as a routing problem that can incorporate local context signals as priors, enabling locally plausible languages to be recognized and routed appropriately without retraining. By formalizing the Bayesian view $P( ext{ell}|X) = P(X| ext{ell})P( ext{ell})$ and illustrating the dangers of global priors (e.g., rare languages being overwhelmed, false positives at scale), the authors propose context-aware priors and gating mechanisms as practical paths forward, supported by case studies on Louisiana Creole and Lingua Franca. They also discuss incentives and structural barriers, offering two forward-looking strategies, along with evaluation and transparency practices to ensure usable, ethically aware deployment for tail-language communities.

Abstract

Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.

How Should We Model the Probability of a Language?

TL;DR

and illustrating the dangers of global priors (e.g., rare languages being overwhelmed, false positives at scale), the authors propose context-aware priors and gating mechanisms as practical paths forward, supported by case studies on Louisiana Creole and Lingua Franca. They also discuss incentives and structural barriers, offering two forward-looking strategies, along with evaluation and transparency practices to ensure usable, ethically aware deployment for tail-language communities.

Abstract

Paper Structure (25 sections, 8 equations, 1 figure)

This paper contains 25 sections, 8 equations, 1 figure.

Introduction
The Received Framing of LID
Standard Approaches
Alternative Framings
Probability Problems
Global Frequency?
Attenuated Frequency?
False Positives at Scale
Local Priors?
Where Do Languages Live?
Dataset Difficulties
Model Expectations
Towards Context-Aware LID
Case Studies
Language Revitalization
...and 10 more sections

Figures (1)

Figure 1: Interfaces allow override, but not for LC. (Google Translate)

How Should We Model the Probability of a Language?

TL;DR

Abstract

How Should We Model the Probability of a Language?

Authors

TL;DR

Abstract

Table of Contents

Figures (1)