Formal Aspects of Language Modeling
Ryan Cotterell, Anej Svete, Clara Meister, Tianyu Liu, Li Du
TL;DR
The notes formalize language modeling from measure-theoretic first principles to practical, representation-based implementations. They contrast globally and locally normalized approaches, introducing the concept of tightness to ensure well-defined distributions over finite strings, while enabling mass to leak to infinite sequences in non-tight cases. The RB-LM framework combines vector representations with projection onto the probability simplex to define conditional next-token distributions, linking representation learning with formal probabilistic structure. Finite-state models (WFSAs and PFSA) are connected to n-gram and subregular theories, providing computationally tractable baselines and insights into modern neural architectures. Overall, the work establishes a rigorous foundation for analyzing language models, their normalization, and the trade-offs between expressiveness, tractability, and training objectives, with implications for both theory and practical model design.
Abstract
Large language models have become one of the most commonly deployed NLP inventions. In the past half-decade, their integration into core natural language processing tools has dramatically increased the performance of such tools, and they have entered the public discourse surrounding artificial intelligence. Consequently, it is important for both developers and researchers alike to understand the mathematical foundations of large language models, as well as how to implement them. These notes are the accompaniment to the theoretical portion of the ETH Zürich course on large language models, covering what constitutes a language model from a formal, theoretical perspective.
