Formal Aspects of Language Modeling

Ryan Cotterell; Anej Svete; Clara Meister; Tianyu Liu; Li Du

Formal Aspects of Language Modeling

Ryan Cotterell, Anej Svete, Clara Meister, Tianyu Liu, Li Du

TL;DR

The notes formalize language modeling from measure-theoretic first principles to practical, representation-based implementations. They contrast globally and locally normalized approaches, introducing the concept of tightness to ensure well-defined distributions over finite strings, while enabling mass to leak to infinite sequences in non-tight cases. The RB-LM framework combines vector representations with projection onto the probability simplex to define conditional next-token distributions, linking representation learning with formal probabilistic structure. Finite-state models (WFSAs and PFSA) are connected to n-gram and subregular theories, providing computationally tractable baselines and insights into modern neural architectures. Overall, the work establishes a rigorous foundation for analyzing language models, their normalization, and the trade-offs between expressiveness, tractability, and training objectives, with implications for both theory and practical model design.

Abstract

Large language models have become one of the most commonly deployed NLP inventions. In the past half-decade, their integration into core natural language processing tools has dramatically increased the performance of such tools, and they have entered the public discourse surrounding artificial intelligence. Consequently, it is important for both developers and researchers alike to understand the mathematical foundations of large language models, as well as how to implement them. These notes are the accompaniment to the theoretical portion of the ETH Zürich course on large language models, covering what constitutes a language model from a formal, theoretical perspective.

Formal Aspects of Language Modeling

TL;DR

Abstract

Paper Structure (211 sections, 85 theorems, 371 equations, 37 figures, 3 tables, 2 algorithms)

This paper contains 211 sections, 85 theorems, 371 equations, 37 figures, 3 tables, 2 algorithms.

Introduction
Introduction
Disclaimer.
Probabilistic Foundations
An Invitation to Language Modeling
A Measure-theoretic Foundation
Language Models: Distributions over Strings
Sets of Strings
A note on terminology.
Defining a Language Model
Global and Local Normalization
A note on terminology.
The beginning of sequence string symbol.
Globally Normalized Language Models
Normalizability
...and 196 more sections

Key Result

theorem 1

Normalizable energy functions induce language models Any normalizable energy function ${p_{\scaleto{\text{GN}}{4pt}}}$ induces a language model, i.e., a distribution over ${{\Sigma}^*}$.

Figures (37)

Figure 1: Graphical depiction of the possibly finite coin toss model. The final weight $\frac{1}{2}$ of the state $2$ corresponds to the probability ${p}\left({\textsc{eos}}\xspace \mid {y}_{{t} - 1} = \texttt{T} \right) = \frac{1}{2}$.
Figure 2: "Examples" of a locally and a globally normalized language model.
Figure 3: Tight and non-tight bigram models, expressed as Mealy machines. Symbols with conditional probability of 0 are omitted.
Figure 4: The outline of our measure-theoretic treatment of LNMs in this section to arrive at a precise characterization of ${p_{\scaleto{\text{LN}}{4pt}}}$. The final box corresponds to the sequence model (probability measure over ${{\Sigma}^*} \cup {\Sigma}^\infty$) constructed for ${p_{\scaleto{\text{LN}}{4pt}}}$.
Figure 5: Example of a simple FSA.
...and 32 more figures

Theorems & Definitions (307)

definition 1
definition 2
definition 3
definition 4
definition 5
definition 6
definition 7
definition 8
definition 9
definition 10
...and 297 more

Formal Aspects of Language Modeling

TL;DR

Abstract

Formal Aspects of Language Modeling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (37)

Theorems & Definitions (307)