Position: Understanding LLMs Requires More Than Statistical Generalization

Patrik Reizinger; Szilvia Ujváry; Anna Mészáros; Anna Kerekes; Wieland Brendel; Ferenc Huszár

Position: Understanding LLMs Requires More Than Statistical Generalization

Patrik Reizinger, Szilvia Ujváry, Anna Mészáros, Anna Kerekes, Wieland Brendel, Ferenc Huszár

TL;DR

The paper argues that understanding large language models requires looking beyond statistical generalization and into the saturation regime, where multiple models share the same minimal test loss yet differ in practical capabilities. It introduces non-identifiability concepts—functional, $\varepsilon$-non-identifiability, and parameter non-identifiability—and supports them with three case studies on rule extrapolation, in-context learning, and fine-tuning. The authors propose focusing on inductive biases and qualitative generalization measures, and outline directions involving formal languages, computational language modeling, and mechanistic interpretability to study transfer and extrapolation. This perspective aims to establish principled, task-specific notions of generalization and transfer that better capture LLM behavior than traditional loss-based metrics.

Abstract

The last decade has seen blossoming research in deep learning theory attempting to answer, "Why does deep learning generalize?" A powerful shift in perspective precipitated this progress: the study of overparametrized models in the interpolation regime. In this paper, we argue that another perspective shift is due, since some of the desirable qualities of LLMs are not a consequence of good statistical generalization and require a separate theoretical explanation. Our core argument relies on the observation that AR probabilistic models are inherently non-identifiable: models zero or near-zero KL divergence apart -- thus, equivalent test loss -- can exhibit markedly different behaviors. We support our position with mathematical examples and empirical observations, illustrating why non-identifiability has practical relevance through three case studies: (1) the non-identifiability of zero-shot rule extrapolation; (2) the approximate non-identifiability of in-context learning; and (3) the non-identifiability of fine-tunability. We review promising research directions focusing on LLM-relevant generalization measures, transferability, and inductive biases.

Position: Understanding LLMs Requires More Than Statistical Generalization

TL;DR

-non-identifiability, and parameter non-identifiability—and supports them with three case studies on rule extrapolation, in-context learning, and fine-tuning. The authors propose focusing on inductive biases and qualitative generalization measures, and outline directions involving formal languages, computational language modeling, and mechanistic interpretability to study transfer and extrapolation. This perspective aims to establish principled, task-specific notions of generalization and transfer that better capture LLM behavior than traditional loss-based metrics.

Abstract

Paper Structure (26 sections, 2 theorems, 32 equations, 5 figures, 4 tables)

This paper contains 26 sections, 2 theorems, 32 equations, 5 figures, 4 tables.

Introduction
Background
Identifiability in
Case Study: Non-Identifiability of rule extrapolation
Case Study: $\varepsilon-$non-identifiability and icl
Case Study: Parameter Non-identifiability and Fine-tuning
The saturation regime
Discussion: where next?
Better generalization measures
Computational language modeling to study transferability
Inductive biases for understanding
Conclusion
Details on $\varepsilon-$identifiability and the saturation regime
Problem framework and notations
Proof of \ref{['our_theorem']}
...and 11 more sections

Key Result

Proposition 3.1

[$\varepsilon-$non-identifiability of ] Let $N$ be the length of a prompt $(S_n, x_{\text{test}})$. For all $\varepsilon >0$, there exists $n_1 \geq n_0$, such that for all $n \geq n_1$, there exists a distribution $q_n$ close to a mixture of $p$ in divergence

Figures (5)

Figure 1: Illustration of case study \ref{['sec:case-study-1']}: We train a Transformer on a generating sequences of the form $a^nb^n$. Left: This language can be represented as an intersection of two rules: \ref{['rule1']} the number of $a$s and $b$s match; and \ref{['rule2']}$a$ never follows a $b$. Right: We consider different models (M1, M2) which achieve perfect test loss. On prompts consistent with the $a^nb^n$ grammar (e.g., $aa$) models produce the same completions. However, on prompts that are inconsistent with $a^nb^n$, and thus have probability zero under the pre-training distribution, the models may produce different completions. For these prompts, we can ask if the completed prompts still satisfy rule \ref{['rule1']}, which we call rule extrapolation. Rule extrapolation behaviour is not implied by minimal test loss, but may arise due to inductive biases.
Figure 2: rule extrapolation in Transformers is better than chance: We trained a Transformer via maximum likelihood on the $a^nb^n$ . We evaluated the model on prompts which are inconsistent with $a^nb^n$, and checked whether the completions obey rule \ref{['rule1']} ($x$ axis). Two other models, trained by an adversarial and an oracle process achieved the same test loss but displayed very different rule extrapolation accuracies. This demonstrates that test loss is insensitive to rule extrapolation behaviour and that the $43.7\%$ rule extrapolation accuracy (averaged over 20 seeds; details in \ref{['sec:app_exp']}) results from inductive biases.
Figure 3: Vanishingly small cannot capture icl: illustration of \ref{['our_theorem']}, showing that when $p$ displays property, there exists a distribution $q$ that is $\varepsilon-$close in KL divergence, which has no ability.
Figure 4: Illustration of parameter non-identifiability: Two sets of parameters ($\theta_1, \theta_2$) may describe the same AR and thus achieve the same test loss and perform identically in benchmarks. When fine-tuned on the same data, parameter-dependent inductive biases may push the two models apart, and it is possible that, say, $\theta_1$ enables significantly more data-efficient fine-tuning than $\theta_2$.
Figure 5: Illustration of how inductive biases can affect identifiability: In the saturation regime, training can result in different parameters $\theta_1,\theta_2$ with the same training and test loss, but different downstream performance. Even if the loss is insensitive to a model property that is required for good downstream performance, choosing a useful inductive bias can help capture said property, overcoming its non-identifiability.

Theorems & Definitions (10)

Definition 3.1: $\varepsilon-$non-identifiability of distributional properties (informal)
Proposition 3.1
proof : Proof (Sketch)
Proposition 3.1
proof
Definition 3.1
Definition 4.1: Set of probability measures
Definition 4.2: Property
Definition 4.3: Property equivalence classes
Definition 4.4: $\varepsilon-$non-identifiability of distributional properties

Position: Understanding LLMs Requires More Than Statistical Generalization

TL;DR

Abstract

Position: Understanding LLMs Requires More Than Statistical Generalization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (10)