Language Generation and Identification From Partial Enumeration: Tight Density Bounds and Topological Characterizations

Jon Kleinberg; Fan Wei

Language Generation and Identification From Partial Enumeration: Tight Density Bounds and Topological Characterizations

Jon Kleinberg, Fan Wei

TL;DR

This work analyzes language generation in the limit under adversaries that either enumerate the full language or only an infinite subset, establishing a tight lower-density bound of $\tfrac{1}{2}$ in the full-model and a $\alpha/2$ bound in the partial-enumeration setting when the revealed subset has lower density $\alpha$ in $K$. It introduces conjunction-based representations (finite intersections) and proves the existence of a generation-in-the-limit algorithm that is accurate infinitely often, even with partial enumeration. A central contribution is a topological reformulation of Angluin’s identification theorem, connecting identifiability to $T_D$ separation properties in spaces $\tau_{C}$ and revealing robustness to finite deletions in the partial-enumeration regime. The results unify generation and identification through a shared topological lens, provide constructive algorithms (e.g., Algorithm 1 with pods), and illuminate how partial information affects learnability and breadth in language learning tasks.

Abstract

The success of large language models (LLMs) has motivated formal theories of language generation and learning. We study the framework of \emph{language generation in the limit}, where an adversary enumerates strings from an unknown language $K$ drawn from a countable class, and an algorithm must generate unseen strings from $K$. Prior work showed that generation is always possible, and that some algorithms achieve positive lower density, revealing a \emph{validity--breadth} trade-off between correctness and coverage. We resolve a main open question in this line, proving a tight bound of $1/2$ on the best achievable lower density. We then strengthen the model to allow \emph{partial enumeration}, where the adversary reveals only an infinite subset $C \subseteq K$. We show that generation in the limit remains achievable, and if $C$ has lower density $α$ in $K$, the algorithm's output achieves density at least $α/2$, matching the upper bound. This generalizes the $1/2$ bound to the partial-information setting, where the generator must recover within a factor $1/2$ of the revealed subset's density. We further revisit the classical Gold--Angluin model of \emph{language identification} under partial enumeration. We characterize when identification in the limit is possible -- when hypotheses $M_t$ eventually satisfy $C \subseteq M \subseteq K$ -- and in the process give a new topological formulation of Angluin's characterization, showing that her condition is precisely equivalent to an appropriate topological space having the $T_D$ separation property.

Language Generation and Identification From Partial Enumeration: Tight Density Bounds and Topological Characterizations

TL;DR

This work analyzes language generation in the limit under adversaries that either enumerate the full language or only an infinite subset, establishing a tight lower-density bound of

in the full-model and a

bound in the partial-enumeration setting when the revealed subset has lower density

. It introduces conjunction-based representations (finite intersections) and proves the existence of a generation-in-the-limit algorithm that is accurate infinitely often, even with partial enumeration. A central contribution is a topological reformulation of Angluin’s identification theorem, connecting identifiability to

separation properties in spaces

and revealing robustness to finite deletions in the partial-enumeration regime. The results unify generation and identification through a shared topological lens, provide constructive algorithms (e.g., Algorithm 1 with pods), and illuminate how partial information affects learnability and breadth in language learning tasks.

Abstract

drawn from a countable class, and an algorithm must generate unseen strings from

. Prior work showed that generation is always possible, and that some algorithms achieve positive lower density, revealing a \emph{validity--breadth} trade-off between correctness and coverage. We resolve a main open question in this line, proving a tight bound of

on the best achievable lower density. We then strengthen the model to allow \emph{partial enumeration}, where the adversary reveals only an infinite subset

. We show that generation in the limit remains achievable, and if

has lower density

, the algorithm's output achieves density at least

, matching the upper bound. This generalizes the

bound to the partial-information setting, where the generator must recover within a factor

of the revealed subset's density. We further revisit the classical Gold--Angluin model of \emph{language identification} under partial enumeration. We characterize when identification in the limit is possible -- when hypotheses

eventually satisfy

-- and in the process give a new topological formulation of Angluin's characterization, showing that her condition is precisely equivalent to an appropriate topological space having the

separation property.

Language Generation and Identification From Partial Enumeration: Tight Density Bounds and Topological Characterizations

TL;DR

Abstract

Language Generation and Identification From Partial Enumeration: Tight Density Bounds and Topological Characterizations

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (60)