Table of Contents
Fetching ...

Local Grammar-Based Coding Revisited

Łukasz Dębowski

TL;DR

The paper studies minimal local grammar-based coding, showing universal properties and links to power-law linguistics. It uses a harmonic bound on ranked probabilities $\pi_n \le 1/n$ to strengthen universality proofs and develops bounds relating vocabulary size to mutual information and redundancy. It extends the framework to infinite vocabularies via wrapped codes and rank-list constructions, and demonstrates universality for both infinite and trimmed/fixed-vocabulary schemes. Overall, the results bolster connections among Zipf's law, Heaps' law, and Hilberg's law and suggest principled tokenization strategies for universal coding with potential relevance to language models.

Abstract

In the setting of minimal local grammar-based coding, the input string is represented as a grammar with the minimal output length defined via simple symbol-by-symbol encoding. This paper discusses four contributions to this field. First, we invoke a simple harmonic bound on ranked probabilities, which reminds Zipf's law and simplifies universality proofs for minimal local grammar-based codes. Second, we refine known bounds on the vocabulary size, showing its partial power-law equivalence with mutual information and redundancy. These bounds are relevant for linking Zipf's law with the neural scaling law for large language models. Third, we develop a framework for universal codes with fixed infinite vocabularies, recasting universal coding as matching ranked patterns that are independent of empirical data. Finally, we analyze grammar-based codes with finite vocabularies being empirical rank lists, proving that that such codes are also universal. These results extend foundations of universal grammar-based coding and reaffirm previously stated connections to power laws for human language and language models.

Local Grammar-Based Coding Revisited

TL;DR

The paper studies minimal local grammar-based coding, showing universal properties and links to power-law linguistics. It uses a harmonic bound on ranked probabilities to strengthen universality proofs and develops bounds relating vocabulary size to mutual information and redundancy. It extends the framework to infinite vocabularies via wrapped codes and rank-list constructions, and demonstrates universality for both infinite and trimmed/fixed-vocabulary schemes. Overall, the results bolster connections among Zipf's law, Heaps' law, and Hilberg's law and suggest principled tokenization strategies for universal coding with potential relevance to language models.

Abstract

In the setting of minimal local grammar-based coding, the input string is represented as a grammar with the minimal output length defined via simple symbol-by-symbol encoding. This paper discusses four contributions to this field. First, we invoke a simple harmonic bound on ranked probabilities, which reminds Zipf's law and simplifies universality proofs for minimal local grammar-based codes. Second, we refine known bounds on the vocabulary size, showing its partial power-law equivalence with mutual information and redundancy. These bounds are relevant for linking Zipf's law with the neural scaling law for large language models. Third, we develop a framework for universal codes with fixed infinite vocabularies, recasting universal coding as matching ranked patterns that are independent of empirical data. Finally, we analyze grammar-based codes with finite vocabularies being empirical rank lists, proving that that such codes are also universal. These results extend foundations of universal grammar-based coding and reaffirm previously stated connections to power laws for human language and language models.
Paper Structure (17 sections, 24 theorems, 59 equations)

This paper contains 17 sections, 24 theorems, 59 equations.

Key Result

Theorem 1

An incomplete distribution $Q$ is universal if for any $k\ge 1$, any conditional probability distribution $\pi:\mathbb{X}\times\mathbb{X}^k\to[0,1]$, and any string $x_1^n\in\mathbb{X}^*$, we have where $\lim_{k\to\infty} \limsup_{n\to\infty} C(n,k)/n=0$.

Theorems & Definitions (42)

  • Definition 1: straight-line grammar
  • Definition 2: grammar expansion
  • Definition 3: finite grammar
  • Definition 4: flat grammar
  • Definition 5: block grammar
  • Definition 6
  • Definition 7: minimal grammar transform
  • Definition 8: local grammar encoder
  • Definition 9: minimal code
  • Definition 10: proper code
  • ...and 32 more