Local Grammar-Based Coding Revisited
Łukasz Dębowski
TL;DR
The paper studies minimal local grammar-based coding, showing universal properties and links to power-law linguistics. It uses a harmonic bound on ranked probabilities $\pi_n \le 1/n$ to strengthen universality proofs and develops bounds relating vocabulary size to mutual information and redundancy. It extends the framework to infinite vocabularies via wrapped codes and rank-list constructions, and demonstrates universality for both infinite and trimmed/fixed-vocabulary schemes. Overall, the results bolster connections among Zipf's law, Heaps' law, and Hilberg's law and suggest principled tokenization strategies for universal coding with potential relevance to language models.
Abstract
In the setting of minimal local grammar-based coding, the input string is represented as a grammar with the minimal output length defined via simple symbol-by-symbol encoding. This paper discusses four contributions to this field. First, we invoke a simple harmonic bound on ranked probabilities, which reminds Zipf's law and simplifies universality proofs for minimal local grammar-based codes. Second, we refine known bounds on the vocabulary size, showing its partial power-law equivalence with mutual information and redundancy. These bounds are relevant for linking Zipf's law with the neural scaling law for large language models. Third, we develop a framework for universal codes with fixed infinite vocabularies, recasting universal coding as matching ranked patterns that are independent of empirical data. Finally, we analyze grammar-based codes with finite vocabularies being empirical rank lists, proving that that such codes are also universal. These results extend foundations of universal grammar-based coding and reaffirm previously stated connections to power laws for human language and language models.
