Pelican Soup Framework: A Theoretical Framework for Language Model Capabilities

Ting-Rui Chiang; Dani Yogatama

Pelican Soup Framework: A Theoretical Framework for Language Model Capabilities

Ting-Rui Chiang, Dani Yogatama

TL;DR

The paper investigates why pretraining enables large language models to follow prompts and perform in-context learning, proposing the Pelican Soup framework as a minimal theoretical model grounded in consistency and an expression-meaning association. It formalizes tasks via a KB and a finite set of atom concepts, derives an average ICL-loss bound with an $\mathcal{O}(1/T)$ convergence rate, and links this bound to description length under additional assumptions. The authors validate the framework through the Calcutec synthetic experiments and real-world pronoun-based prompting, demonstrating ICL emergence, generalization under distribution shifts, and instruction-following capabilities, including multi-step reasoning. The work provides a conceptual bridge between linguistic/psychological theories and empirical ICL phenomena, offering guidance for pretraining design and future research into robust instruction-following and generalization in LLMs.

Abstract

In this work, we propose a simple theoretical framework, Pelican Soup, aiming to better understand how pretraining allows LLMs to (1) generalize to unseen instructions and (2) perform in-context learning, even when the verbalizers are irrelevant to the task. To this end, in our framework, we introduce the notion of "knowledge base" and "reference-sense association" and a simple formalism for natural language processing tasks. Our framework demonstrates how linguistic, psychology, and philosophy studies can inform our understanding of the language model and is connected to several other existing theoretical results. As an illustration of the usage of our framework, we derive a bound on in-context learning loss with our framework. Finally, we support our framework with empirical experiments and provide possible future research directions.

Pelican Soup Framework: A Theoretical Framework for Language Model Capabilities

TL;DR

convergence rate, and links this bound to description length under additional assumptions. The authors validate the framework through the Calcutec synthetic experiments and real-world pronoun-based prompting, demonstrating ICL emergence, generalization under distribution shifts, and instruction-following capabilities, including multi-step reasoning. The work provides a conceptual bridge between linguistic/psychological theories and empirical ICL phenomena, offering guidance for pretraining design and future research into robust instruction-following and generalization in LLMs.

Abstract

Paper Structure (53 sections, 3 theorems, 12 equations, 9 figures, 6 tables)

This paper contains 53 sections, 3 theorems, 12 equations, 9 figures, 6 tables.

Introduction
The Pelican Soup Framework
Motivation
Training Data Distribution
A Formalism for NLP Tasks
Bounding ICL Loss
Generalization
Relating to Description Length
Inspecting Generalization Empirically
Inspecting the ICL Capability
Calcutec
Setup
Training Dataset.
Downstream Tasks.
Demonstration.
...and 38 more sections

Key Result

Theorem 4.1

Denote a sequence of input-output pairs as $S_t = x_1, r_1, d, x_2, r_2, \cdots, x_t, r_t, d$, where $r_i$ is the correct verbalizer with which the label of $x_i$ is associated for $i = 1, 2, \cdots, t$ and $d$ is the delimiter that separates the examples. Let the description of a task that maps inp

Figures (9)

Figure 1: Calcutec examples for training, in-context learning (ICL), and chain-of-thought (CoT).
Figure 2: In-context learning accuracy with Calcutec when using different verbalizers ($r_1, r_2$ or $r_3, r_4$). The dotted lines represent the performance of unseen combinations described in §\ref{['sec:inspecting-dist-shifts']}. The colors represent the number of atom concepts each class ($v_+$ or $v_-$) is associated with. The main lines represent the average accuracy of 5 tasks. Lines in the lighter color represent the individual tasks.
Figure 3: The distribution of lengths and the first step in each paragraph where $z$ is the consequence in the Calcutec dataset. The first/second row are the statistics before/after some steps are randomly dropped.
Figure 4: Proof trees examples.
Figure 5: In-context learning accuracy with Calcutec when using different verbalizers ($y_1, y_2$ or $y_3, y_4$) and input lengths (3 or 4). The dotted lines represent the performance of unseen combinations described in §\ref{['sec:inspecting-dist-shifts']}, while the different colors represent the number of formulas each class ($v_+$ or $v_-$) is associated to. The main lines represent the average accuracy of 5 tasks. We plot the performance of each task in lighter colors.
...and 4 more figures

Theorems & Definitions (5)

Definition 3.5: Perfect LM
Theorem 4.1: Average ICL Likelihood
Corollary 4.2: Expected Average ICL Loss
Theorem C.1: Average ICL Loss for Generation
proof

Pelican Soup Framework: A Theoretical Framework for Language Model Capabilities

TL;DR

Abstract

Pelican Soup Framework: A Theoretical Framework for Language Model Capabilities

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (5)