Table of Contents
Fetching ...

Language Generation: Complexity Barriers and Implications for Learning

Marcelo Arenas, Pablo Barceló, Luis Cofré, Alexander Kozachinskiy

TL;DR

This work investigates language generation in the limit, asking not only whether generation is possible but how many examples are required to imitate a target language from formal families. It shows that while generation is computable in principle for any countable family, the resource requirements can be prohibitively large: for regular languages, a finite family of $n$-state DFAs may require as many as $2^{n^k}$ examples, and there exist finite families that are not $2^{n 2^k}$-generatable; for context-free languages, no computable bound exists even for two infinite languages. The results hinge on the size of finite intersections among the languages in the family, connecting generation feasibility to automata-theoretic constructs and complexity. Taken together with practical observations about large language models, the paper argues that explaining empirical success requires a refined framework that ties generation feasibility to the structural properties of natural language and data distributions, while acknowledging open problems around intersections and their impact on learnability and hallucinations.

Abstract

Kleinberg and Mullainathan showed that, in principle, language generation is always possible: with sufficiently many positive examples, a learner can eventually produce sentences indistinguishable from those of a target language. However, the existence of such a guarantee does not speak to its practical feasibility. In this work, we show that even for simple and well-studied language families -- such as regular and context-free languages -- the number of examples required for successful generation can be extraordinarily large, and in some cases not bounded by any computable function. These results reveal a substantial gap between theoretical possibility and efficient learnability. They suggest that explaining the empirical success of modern language models requires a refined perspective -- one that takes into account structural properties of natural language that make effective generation possible in practice.

Language Generation: Complexity Barriers and Implications for Learning

TL;DR

This work investigates language generation in the limit, asking not only whether generation is possible but how many examples are required to imitate a target language from formal families. It shows that while generation is computable in principle for any countable family, the resource requirements can be prohibitively large: for regular languages, a finite family of -state DFAs may require as many as examples, and there exist finite families that are not -generatable; for context-free languages, no computable bound exists even for two infinite languages. The results hinge on the size of finite intersections among the languages in the family, connecting generation feasibility to automata-theoretic constructs and complexity. Taken together with practical observations about large language models, the paper argues that explaining empirical success requires a refined framework that ties generation feasibility to the structural properties of natural language and data distributions, while acknowledging open problems around intersections and their impact on learnability and hallucinations.

Abstract

Kleinberg and Mullainathan showed that, in principle, language generation is always possible: with sufficiently many positive examples, a learner can eventually produce sentences indistinguishable from those of a target language. However, the existence of such a guarantee does not speak to its practical feasibility. In this work, we show that even for simple and well-studied language families -- such as regular and context-free languages -- the number of examples required for successful generation can be extraordinarily large, and in some cases not bounded by any computable function. These results reveal a substantial gap between theoretical possibility and efficient learnability. They suggest that explaining the empirical success of modern language models requires a refined perspective -- one that takes into account structural properties of natural language that make effective generation possible in practice.

Paper Structure

This paper contains 11 sections, 5 theorems, 8 equations.

Key Result

Proposition 1

A set $\mathcal{F}$ of infinite languages is not $m$-generatable if and only if there exists a non-empty $\mathcal{S}\subseteq \mathcal{F}$ such that $\bigcap_{L\in\mathcal{S}} L$ is finite but has size at least $m$.

Theorems & Definitions (11)

  • Definition 1: Language generation in the limit.
  • Definition 2: Uniform generation
  • Proposition 1
  • proof
  • Theorem 1
  • proof
  • Corollary 1
  • Theorem 2
  • proof
  • Lemma 1
  • ...and 1 more