Table of Contents
Fetching ...

How Should We Model the Probability of a Language?

Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, Benoît Sagot

TL;DR

This paper argues that broad LID coverage is hampered by a decontextualized framing that binds language identification to a single global prior over a fixed label set. It reframes LID as a routing problem that can incorporate local context signals as priors, enabling locally plausible languages to be recognized and routed appropriately without retraining. By formalizing the Bayesian view $P( ext{ell}|X) = P(X| ext{ell})P( ext{ell})$ and illustrating the dangers of global priors (e.g., rare languages being overwhelmed, false positives at scale), the authors propose context-aware priors and gating mechanisms as practical paths forward, supported by case studies on Louisiana Creole and Lingua Franca. They also discuss incentives and structural barriers, offering two forward-looking strategies, along with evaluation and transparency practices to ensure usable, ethically aware deployment for tail-language communities.

Abstract

Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.

How Should We Model the Probability of a Language?

TL;DR

This paper argues that broad LID coverage is hampered by a decontextualized framing that binds language identification to a single global prior over a fixed label set. It reframes LID as a routing problem that can incorporate local context signals as priors, enabling locally plausible languages to be recognized and routed appropriately without retraining. By formalizing the Bayesian view and illustrating the dangers of global priors (e.g., rare languages being overwhelmed, false positives at scale), the authors propose context-aware priors and gating mechanisms as practical paths forward, supported by case studies on Louisiana Creole and Lingua Franca. They also discuss incentives and structural barriers, offering two forward-looking strategies, along with evaluation and transparency practices to ensure usable, ethically aware deployment for tail-language communities.

Abstract

Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.
Paper Structure (25 sections, 8 equations, 1 figure)