Table of Contents
Fetching ...

Language Generation and Identification From Partial Enumeration: Tight Density Bounds and Topological Characterizations

Jon Kleinberg, Fan Wei

TL;DR

This work analyzes language generation in the limit under adversaries that either enumerate the full language or only an infinite subset, establishing a tight lower-density bound of $\tfrac{1}{2}$ in the full-model and a $\alpha/2$ bound in the partial-enumeration setting when the revealed subset has lower density $\alpha$ in $K$. It introduces conjunction-based representations (finite intersections) and proves the existence of a generation-in-the-limit algorithm that is accurate infinitely often, even with partial enumeration. A central contribution is a topological reformulation of Angluin’s identification theorem, connecting identifiability to $T_D$ separation properties in spaces $\tau_{C}$ and revealing robustness to finite deletions in the partial-enumeration regime. The results unify generation and identification through a shared topological lens, provide constructive algorithms (e.g., Algorithm 1 with pods), and illuminate how partial information affects learnability and breadth in language learning tasks.

Abstract

The success of large language models (LLMs) has motivated formal theories of language generation and learning. We study the framework of \emph{language generation in the limit}, where an adversary enumerates strings from an unknown language $K$ drawn from a countable class, and an algorithm must generate unseen strings from $K$. Prior work showed that generation is always possible, and that some algorithms achieve positive lower density, revealing a \emph{validity--breadth} trade-off between correctness and coverage. We resolve a main open question in this line, proving a tight bound of $1/2$ on the best achievable lower density. We then strengthen the model to allow \emph{partial enumeration}, where the adversary reveals only an infinite subset $C \subseteq K$. We show that generation in the limit remains achievable, and if $C$ has lower density $α$ in $K$, the algorithm's output achieves density at least $α/2$, matching the upper bound. This generalizes the $1/2$ bound to the partial-information setting, where the generator must recover within a factor $1/2$ of the revealed subset's density. We further revisit the classical Gold--Angluin model of \emph{language identification} under partial enumeration. We characterize when identification in the limit is possible -- when hypotheses $M_t$ eventually satisfy $C \subseteq M \subseteq K$ -- and in the process give a new topological formulation of Angluin's characterization, showing that her condition is precisely equivalent to an appropriate topological space having the $T_D$ separation property.

Language Generation and Identification From Partial Enumeration: Tight Density Bounds and Topological Characterizations

TL;DR

This work analyzes language generation in the limit under adversaries that either enumerate the full language or only an infinite subset, establishing a tight lower-density bound of in the full-model and a bound in the partial-enumeration setting when the revealed subset has lower density in . It introduces conjunction-based representations (finite intersections) and proves the existence of a generation-in-the-limit algorithm that is accurate infinitely often, even with partial enumeration. A central contribution is a topological reformulation of Angluin’s identification theorem, connecting identifiability to separation properties in spaces and revealing robustness to finite deletions in the partial-enumeration regime. The results unify generation and identification through a shared topological lens, provide constructive algorithms (e.g., Algorithm 1 with pods), and illuminate how partial information affects learnability and breadth in language learning tasks.

Abstract

The success of large language models (LLMs) has motivated formal theories of language generation and learning. We study the framework of \emph{language generation in the limit}, where an adversary enumerates strings from an unknown language drawn from a countable class, and an algorithm must generate unseen strings from . Prior work showed that generation is always possible, and that some algorithms achieve positive lower density, revealing a \emph{validity--breadth} trade-off between correctness and coverage. We resolve a main open question in this line, proving a tight bound of on the best achievable lower density. We then strengthen the model to allow \emph{partial enumeration}, where the adversary reveals only an infinite subset . We show that generation in the limit remains achievable, and if has lower density in , the algorithm's output achieves density at least , matching the upper bound. This generalizes the bound to the partial-information setting, where the generator must recover within a factor of the revealed subset's density. We further revisit the classical Gold--Angluin model of \emph{language identification} under partial enumeration. We characterize when identification in the limit is possible -- when hypotheses eventually satisfy -- and in the process give a new topological formulation of Angluin's characterization, showing that her condition is precisely equivalent to an appropriate topological space having the separation property.

Paper Structure

This paper contains 19 sections, 28 theorems, 61 equations.

Key Result

Theorem 1.3

There is an algorithm ${\mathcal{A}}$ that achieves generation in the limit and has the following property. Given any countable collection of languages ${\mathcal{X}}$ with an underlying ordering of the strings in all the languages, and for any adversarial enumeration $E$ of one of the languages ${K

Theorems & Definitions (60)

  • Definition 1.1
  • Theorem 1.3
  • Theorem 1.5
  • Theorem 1.6
  • Theorem 1.7
  • Theorem 1.8
  • Theorem 1.9: Topological restatement of the Angluin Theorem
  • Theorem 1.10
  • Theorem 1.11
  • Theorem 2.1: Restatement of Theorem \ref{['thm:introKWnew']}
  • ...and 50 more