From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When

Kevin Christian Wibisono; Yixin Wang

From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When

Kevin Christian Wibisono, Yixin Wang

TL;DR

This work examines what enables ICL in models trained on unstructured data, and identifies two cases where ICL fails: one in logic reasoning tasks that require generalizing to new, unseen patterns, and another in analogy completion where relevant word pairs appear only in fixed training positions.

Abstract

Large language models (LLMs) like transformers demonstrate impressive in-context learning (ICL) capabilities, allowing them to make predictions for new tasks based on prompt exemplars without parameter updates. While existing ICL theories often assume structured training data resembling ICL tasks (e.g., x-y pairs for linear regression), LLMs are typically trained unsupervised on unstructured text, such as web content, which lacks clear parallels to tasks like word analogy. To address this gap, we examine what enables ICL in models trained on unstructured data, focusing on critical sequence model requirements and training data structure. We find that many ICL capabilities can emerge simply from co-occurrence of semantically related word pairs in unstructured data; word analogy completion, for example, can provably arise purely through co-occurrence modeling, using classical language models like continuous bag of words (CBOW), without needing positional information or attention mechanisms. However, positional information becomes crucial for logic reasoning tasks requiring generalization to unseen tokens. Finally, we identify two cases where ICL fails: one in logic reasoning tasks that require generalizing to new, unseen patterns, and another in analogy completion where relevant word pairs appear only in fixed training positions. These findings suggest that LLMs' ICL abilities depend heavily on the structural elements within their training data.

From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When

TL;DR

Abstract

Paper Structure (28 sections, 8 theorems, 26 equations, 2 figures, 9 tables)

This paper contains 28 sections, 8 theorems, 26 equations, 2 figures, 9 tables.

Introduction
In-context learning can arise by modeling co-occurrence via CBOW
In-context learning on single-relationship word analogy tasks
In-context learning on dual-connected-relationship word analogy tasks
In-context learning on dual-disjoint-relationship tasks
Experiments on a synthetic corpus
The essential role of positional information in enabling in-context learning
In-context learning on single-pattern tasks
In-context learning on dual-pattern tasks
Scenarios where in-context learning fails
Failed scenario 1: Sentences with repeating patterns
Failed scenario 2: Sentences with co-occurring word pairs restricted to fixed locations
Experiment on a synthetic corpus
Discussion
Limitations and future work
...and 13 more sections

Key Result

Theorem 1

Let $K, L \geq S \geq 3$. Suppose each training sentence of length $S$ is generated by selecting one $(c_i, d_i)$ pair and $S - 2$ distinct $r_i$'s uniformly at random. We train a CBOW model with the squared loss and a sufficiently large embedding dimension on these sentences. Given a prompt $c_{i_1

Figures (2)

Figure 1: This paper identifies essential components for in-context learning (ICL) from pre-training on unstructured natural language data. Left sub-panels, right sub-panels, and boxed letters denote NLP examples, our abstractions, and expected outputs, respectively. Section \ref{['sec:co-occ']} shows that ICL for word analogy tasks can arise via modeling co-occurrence information using classical language models like continuous bag of words (CBOW) (violet represents relationship-specific nuisance tokens). Section \ref{['sec:icl-pos']} establishes the necessity of modeling positional information and blocked nuisance structure for ICL tasks, enabling pattern recognition and generalization to novel tokens (violet represents nuisance tokens). Section \ref{['sec:failed']} presents scenarios where ICL fails, providing theoretical explanations that underscore the critical role of training data structure in enabling ICL in language models.
Figure 2: One-layer models fail to differentiate the two patterns in Section \ref{['sec:two-patterns']}, as evidenced by the accuracy trajectory graph on the left. On the other hand, five-layer models are capable of doing so.

Theorems & Definitions (16)

Theorem 1: ICL on single-relationship word analogy tasks
Theorem 2: Task selection in CBOW
Theorem 3: Necessity of modeling positions
Proposition 4: Multi-layer models can encode positions
Theorem 5: Blocked nuisance token structure facilitates ICL
Theorem 6: Failure of ICL: Different repeated patterns
Theorem 7: Failure of ICL: Different pattern structures
proof
Lemma 8
proof
...and 6 more

From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When

TL;DR

Abstract

From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (16)