Table of Contents
Fetching ...

Banach density of generated languages: Dichotomies in topology and dimension

Jon Kleinberg, Fan Wei

Abstract

The formalism of language generation in the limit studies generative models by requiring an algorithm, given strings from a hidden true language, to eventually generate new valid strings. A core issue is the tension between validity and breadth. Prior work quantified breadth via asymptotic density, where the priority is generating strings early in a natural countable ordering. Here, we study density when the strings are embedded in $d$ dimensions, a ubiquitous structure in current generative models. Our goal is for the generated strings to be dense throughout the embedding. This requires a different measure, the Banach density, which captures whether a set contains large sparse regions. Using Banach density uncovers a rich structure based on dimension and the topology of the language collection. We prove that in dimension one, when the underlying topological space has finite Cantor-Bendixson rank, an algorithm can always generate a subset of the true language with an optimal lower Banach density of 1/2. However, for collections with infinite Cantor-Bendixson rank, there are cases where no algorithm can achieve any positive lower Banach density; the generated set must contain arbitrarily large, sparse regions. This reveals a topological contrast unseen with asymptotic density, where 1/2 is always achievable. We also extend our results to a family of measures interpolating between Banach and asymptotic density. Finally, in dimension $d \geq 2$, our positive result for Banach density encounters a Ramsey-theoretic obstacle regarding two-colored point sets. Overcoming this requires a nondegeneracy condition: the embedding of the true language must be sufficiently represented throughout the full $d$-dimensional space.

Banach density of generated languages: Dichotomies in topology and dimension

Abstract

The formalism of language generation in the limit studies generative models by requiring an algorithm, given strings from a hidden true language, to eventually generate new valid strings. A core issue is the tension between validity and breadth. Prior work quantified breadth via asymptotic density, where the priority is generating strings early in a natural countable ordering. Here, we study density when the strings are embedded in dimensions, a ubiquitous structure in current generative models. Our goal is for the generated strings to be dense throughout the embedding. This requires a different measure, the Banach density, which captures whether a set contains large sparse regions. Using Banach density uncovers a rich structure based on dimension and the topology of the language collection. We prove that in dimension one, when the underlying topological space has finite Cantor-Bendixson rank, an algorithm can always generate a subset of the true language with an optimal lower Banach density of 1/2. However, for collections with infinite Cantor-Bendixson rank, there are cases where no algorithm can achieve any positive lower Banach density; the generated set must contain arbitrarily large, sparse regions. This reveals a topological contrast unseen with asymptotic density, where 1/2 is always achievable. We also extend our results to a family of measures interpolating between Banach and asymptotic density. Finally, in dimension , our positive result for Banach density encounters a Ramsey-theoretic obstacle regarding two-colored point sets. Overcoming this requires a nondegeneracy condition: the embedding of the true language must be sufficiently represented throughout the full -dimensional space.

Paper Structure

This paper contains 40 sections, 30 theorems, 157 equations, 2 figures, 1 table.

Key Result

Theorem 1.1

There exists a countable collection of languages ${\mathcal{X}}$ such that any algorithm that generates in the limit will produce a generated set $O$ with lower Banach density $\delta_B(O,K) = 0$. $\blacktriangleleft$$\blacktriangleleft$

Figures (2)

  • Figure 1: Graph representing Example \ref{['example:1']}, where edges between adjacent CB ranks indicate set inclusion
  • Figure 2: One possibility of a spanning tree in the the Cantor-Bendixson directed graph of Example \ref{['example:1']}

Theorems & Definitions (89)

  • Theorem 1.1
  • Theorem 1.2
  • Theorem 1.3
  • Theorem 1.4
  • Definition 2.1
  • Definition 2.2
  • Claim 2.3: Monotonicity
  • Claim 2.5
  • Definition 2.1: infinite perfect tower
  • Claim 2.6
  • ...and 79 more