Table of Contents
Fetching ...

Language Generation in the Limit

Jon Kleinberg, Sendhil Mullainathan

TL;DR

The paper formalizes language generation in the adversarial Gold–Angluin setting, showing that generation in the limit is always achievable for any countable collection ${\mathcal C}$ and any enumeration of a true language $K \in {\mathcal C}$, in contrast to identification which is generally impossible. It introduces the core ideas of closure and critical languages to construct a nonconstructive function $f_{\cal C}$ and, more practically, an explicit algorithm that maintains finite prefixes and uses $(t,m)$-critical languages to guarantee eventually outputting $a_t \in K \setminus S_t$ for all large $t$. For finite ${\mathcal C}$, a uniform bound $t({\mathcal C})$ exists so that after observing $t({\mathcal C})$ distinct samples, the algorithm can generate an infinite sequence of unseen elements from $K$, strengthening the result beyond mere existence. The work further extends to prompting, showing that robust prompts allow prompted generation in the limit, and discusses regular-subset queries to broaden the computational toolkit, highlighting a fundamental separation between generation and identification and offering insights for theory and practice in language modeling and prompting under adversarial conditions.

Abstract

Although current large language models are complex, the most basic specifications of the underlying language generation problem itself are simple to state: given a finite set of training samples from an unknown language, produce valid new strings from the language that don't already appear in the training data. Here we ask what we can conclude about language generation using only this specification, without further assumptions. In particular, suppose that an adversary enumerates the strings of an unknown target language L that is known only to come from one of a possibly infinite list of candidates. A computational agent is trying to learn to generate from this language; we say that the agent generates from L in the limit if after some finite point in the enumeration of L, the agent is able to produce new elements that come exclusively from L and that have not yet been presented by the adversary. Our main result is that there is an agent that is able to generate in the limit for every countable list of candidate languages. This contrasts dramatically with negative results due to Gold and Angluin in a well-studied model of language learning where the goal is to identify an unknown language from samples; the difference between these results suggests that identifying a language is a fundamentally different problem than generating from it.

Language Generation in the Limit

TL;DR

The paper formalizes language generation in the adversarial Gold–Angluin setting, showing that generation in the limit is always achievable for any countable collection and any enumeration of a true language , in contrast to identification which is generally impossible. It introduces the core ideas of closure and critical languages to construct a nonconstructive function and, more practically, an explicit algorithm that maintains finite prefixes and uses -critical languages to guarantee eventually outputting for all large . For finite , a uniform bound exists so that after observing distinct samples, the algorithm can generate an infinite sequence of unseen elements from , strengthening the result beyond mere existence. The work further extends to prompting, showing that robust prompts allow prompted generation in the limit, and discusses regular-subset queries to broaden the computational toolkit, highlighting a fundamental separation between generation and identification and offering insights for theory and practice in language modeling and prompting under adversarial conditions.

Abstract

Although current large language models are complex, the most basic specifications of the underlying language generation problem itself are simple to state: given a finite set of training samples from an unknown language, produce valid new strings from the language that don't already appear in the training data. Here we ask what we can conclude about language generation using only this specification, without further assumptions. In particular, suppose that an adversary enumerates the strings of an unknown target language L that is known only to come from one of a possibly infinite list of candidates. A computational agent is trying to learn to generate from this language; we say that the agent generates from L in the limit if after some finite point in the enumeration of L, the agent is able to produce new elements that come exclusively from L and that have not yet been presented by the adversary. Our main result is that there is an agent that is able to generate in the limit for every countable list of candidate languages. This contrasts dramatically with negative results due to Gold and Angluin in a well-studied model of language learning where the goal is to identify an unknown language from samples; the difference between these results suggests that identifying a language is a fundamentally different problem than generating from it.
Paper Structure (36 sections, 2 figures)

This paper contains 36 sections, 2 figures.

Figures (2)

  • Figure 1: This and the next figure show an example of the first five steps of the algorithm from Section \ref{['sec:gen-alg']} on a sample input. (It is useful to read the description of the algorithm before consulting this figure.) The strings produced by the adversary in order over the first five steps are $u_2, u_5, u_8, u_{10}, u_{12}$; in the notation of the figure, the steps of the algorithm are shown separately, each language considered in a given step is shown as a vertical column, and there is a row for each string considered at some point in the step. The string $u_j$ belongs to $L_i$ if and only if there is an "X" in the column for $L_i$ and the row for $u_j$. In step $t$, the rows corresponding to strings the adversary has already produced are shaded (so for example in Step 2, the row for $u_8$ is not shaded because the adversary hasn't yet produced $u_8$; it only does so in Step 3). The algorithm considers the languages in ${\cal C}_t = \{L_1, L_2, \ldots, L_t\}$ in step $t$, and it only considers a finite prefix of each language in step $t$, from an index $m_t^{(0)}$ at the start of the step to an index $m_t$ at the end. The "heights" of the colunms in each step go up to the final index $m_t$. Recall that the algorithm starts with $m = m_t^{(0)}$, finds the highest-indexed $(t,m)$-critical language $L_{n_t(m)}$ among ${\cal C}_t$, and begins searching through strings of increasing index to find a new string in $L_{n_t(m)}$. During this search, as $m$ increases, the identity of the highest-indexed $(t,m)$-critical language might change. In Step t = 1, there is no consistent language in ${\cal C}_t$, so the algorithm can generate an arbitrary string. Step t = 2 starts with $m = m_t^{(0)} = 5$: $L_2$ is $(t,m)$-critical for all $m$, and so the algorithm tests $u_m$ for membership in $L_2$ beginning at $m = 6$ until it finds the first new string in $L_2$, which happens with $m = 7$; $u_7$ is therefore generated. Step t = 3 starts with $m = m_t^{(0)} = 8$: $L_3$ is the highest-indexed $(t,8)$-critical language in ${\cal C}_t$ (since $L_3[8] \subseteq L_2[8]$), and so the algorithm begins searching for the next $u_m \in L_3$, as long as $L_3$ remains $(t,m)$-critical; this happens at $m = 10$, and so $u_{10}$ is generated.
  • Figure 2: This is a continuation of the execution of the algorithm on the sample input from Figure \ref{['fig:alg-steps123b']}. Step t = 4 starts with $m = m_t^{(0)} = 10$:, $L_4$ is consistent but not $(t,10)$-critical (since $L_4[10] \not\subseteq L_3[10]$), and so $L_3$ remains the highest-indexed $(t,10)$-critical language. As before, the algorithm begins searching for the next $u_m \in L_3$, as long as $L_3$ remains $(t,m)$-critical; this happens at $m = 12$, and so $u_{12}$ is generated. Step t = 5 starts with $m = m_t^{(0)} = 12$: $L_5$ is the highest-indexed $(t,12)$-critical language (since $L_5[12]$ is a subset of both $L_3[12]$ and $L_2[12]$), and so the algorithm begins searching for the next $u_m \in L_5$, as long as $L_5$ remains $(t,m)$-critical. Once $m = 14$, however, the algorithm finds that $L_5$ is not $(t,14)$-critical (since $L_5[14] \not\subseteq L_3[14]$), and so $L_3$ is the highest-indexed $(t,14)$-critical language. The algorithm switches to searching for the next $u_m \in L_3$, as long as $L_3$ remains $(t,m)$-critical; this happens when $m = 15$, and so $u_{15}$ is generated.