Table of Contents
Fetching ...

Formal Aspects of Language Modeling

Ryan Cotterell, Anej Svete, Clara Meister, Tianyu Liu, Li Du

TL;DR

The notes formalize language modeling from measure-theoretic first principles to practical, representation-based implementations. They contrast globally and locally normalized approaches, introducing the concept of tightness to ensure well-defined distributions over finite strings, while enabling mass to leak to infinite sequences in non-tight cases. The RB-LM framework combines vector representations with projection onto the probability simplex to define conditional next-token distributions, linking representation learning with formal probabilistic structure. Finite-state models (WFSAs and PFSA) are connected to n-gram and subregular theories, providing computationally tractable baselines and insights into modern neural architectures. Overall, the work establishes a rigorous foundation for analyzing language models, their normalization, and the trade-offs between expressiveness, tractability, and training objectives, with implications for both theory and practical model design.

Abstract

Large language models have become one of the most commonly deployed NLP inventions. In the past half-decade, their integration into core natural language processing tools has dramatically increased the performance of such tools, and they have entered the public discourse surrounding artificial intelligence. Consequently, it is important for both developers and researchers alike to understand the mathematical foundations of large language models, as well as how to implement them. These notes are the accompaniment to the theoretical portion of the ETH Zürich course on large language models, covering what constitutes a language model from a formal, theoretical perspective.

Formal Aspects of Language Modeling

TL;DR

The notes formalize language modeling from measure-theoretic first principles to practical, representation-based implementations. They contrast globally and locally normalized approaches, introducing the concept of tightness to ensure well-defined distributions over finite strings, while enabling mass to leak to infinite sequences in non-tight cases. The RB-LM framework combines vector representations with projection onto the probability simplex to define conditional next-token distributions, linking representation learning with formal probabilistic structure. Finite-state models (WFSAs and PFSA) are connected to n-gram and subregular theories, providing computationally tractable baselines and insights into modern neural architectures. Overall, the work establishes a rigorous foundation for analyzing language models, their normalization, and the trade-offs between expressiveness, tractability, and training objectives, with implications for both theory and practical model design.

Abstract

Large language models have become one of the most commonly deployed NLP inventions. In the past half-decade, their integration into core natural language processing tools has dramatically increased the performance of such tools, and they have entered the public discourse surrounding artificial intelligence. Consequently, it is important for both developers and researchers alike to understand the mathematical foundations of large language models, as well as how to implement them. These notes are the accompaniment to the theoretical portion of the ETH Zürich course on large language models, covering what constitutes a language model from a formal, theoretical perspective.
Paper Structure (211 sections, 85 theorems, 371 equations, 37 figures, 3 tables, 2 algorithms)

This paper contains 211 sections, 85 theorems, 371 equations, 37 figures, 3 tables, 2 algorithms.

Key Result

theorem 1

Normalizable energy functions induce language models Any normalizable energy function ${p_{\scaleto{\text{GN}}{4pt}}}$ induces a language model, i.e., a distribution over ${{\Sigma}^*}$.

Figures (37)

  • Figure 1: Graphical depiction of the possibly finite coin toss model. The final weight $\frac{1}{2}$ of the state $2$ corresponds to the probability ${p}\left({\textsc{eos}}\xspace \mid {y}_{{t} - 1} = \texttt{T} \right) = \frac{1}{2}$.
  • Figure 2: "Examples" of a locally and a globally normalized language model.
  • Figure 3: Tight and non-tight bigram models, expressed as Mealy machines. Symbols with conditional probability of 0 are omitted.
  • Figure 4: The outline of our measure-theoretic treatment of LNMs in this section to arrive at a precise characterization of ${p_{\scaleto{\text{LN}}{4pt}}}$. The final box corresponds to the sequence model (probability measure over ${{\Sigma}^*} \cup {\Sigma}^\infty$) constructed for ${p_{\scaleto{\text{LN}}{4pt}}}$.
  • Figure 5: Example of a simple FSA.
  • ...and 32 more figures

Theorems & Definitions (307)

  • definition 1
  • definition 2
  • definition 3
  • definition 4
  • definition 5
  • definition 6
  • definition 7
  • definition 8
  • definition 9
  • definition 10
  • ...and 297 more