A Characterization of List Language Identification in the Limit
Moses Charikar, Chirag Pabbaraju, Ambuj Tewari
TL;DR
This work studies language identification in the limit when a learner can output a list of size $k$ at each step. It introduces the recursive $k$-Angluin condition, a precise tell-tale based criterion, as the exact characterization for $k$-list identification in the limit, and shows that such collections decompose into a union of $k$ individually 1-list identifiable subcollections. The authors provide a constructive upper-bound algorithm and a diagonalization-based lower bound, prove a stratification result, and establish exponential-rate identifiability in the statistical setting, which is shown to be optimal; they also prove that collections not satisfying the condition admit no vanishing rate. Together, these results illuminate the power gained by allowing small lists in identification tasks and connect worst-case identifiability to probabilistic rates under i.i.d. inputs. These contributions advance the understanding of learnability under limited feedback and have implications for the design of robust language-identification and related inference systems.
Abstract
We study the problem of language identification in the limit, where given a sequence of examples from a target language, the goal of the learner is to output a sequence of guesses for the target language such that all the guesses beyond some finite time are correct. Classical results of Gold showed that language identification in the limit is impossible for essentially any interesting collection of languages. Later, Angluin gave a precise characterization of language collections for which this task is possible. Motivated by recent positive results for the related problem of language generation, we revisit the classic language identification problem in the setting where the learner is given the additional power of producing a list of $k$ guesses at each time step. The goal is to ensure that beyond some finite time, one of the guesses is correct at each time step. We give an exact characterization of collections of languages that can be $k$-list identified in the limit, based on a recursive version of Angluin's characterization (for language identification with a list of size $1$). This further leads to a conceptually appealing characterization: A language collection can be $k$-list identified in the limit if and only if the collection can be decomposed into $k$ collections of languages, each of which can be identified in the limit (with a list of size $1$). We also use our characterization to establish rates for list identification in the statistical setting where the input is drawn as an i.i.d. stream from a distribution supported on some language in the collection. Our results show that if a collection is $k$-list identifiable in the limit, then the collection can be $k$-list identified at an exponential rate, and this is best possible. On the other hand, if a collection is not $k$-list identifiable in the limit, then it cannot be $k$-list identified at any rate that goes to zero.
