Table of Contents
Fetching ...

From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When

Kevin Christian Wibisono, Yixin Wang

TL;DR

This work examines what enables ICL in models trained on unstructured data, and identifies two cases where ICL fails: one in logic reasoning tasks that require generalizing to new, unseen patterns, and another in analogy completion where relevant word pairs appear only in fixed training positions.

Abstract

Large language models (LLMs) like transformers demonstrate impressive in-context learning (ICL) capabilities, allowing them to make predictions for new tasks based on prompt exemplars without parameter updates. While existing ICL theories often assume structured training data resembling ICL tasks (e.g., x-y pairs for linear regression), LLMs are typically trained unsupervised on unstructured text, such as web content, which lacks clear parallels to tasks like word analogy. To address this gap, we examine what enables ICL in models trained on unstructured data, focusing on critical sequence model requirements and training data structure. We find that many ICL capabilities can emerge simply from co-occurrence of semantically related word pairs in unstructured data; word analogy completion, for example, can provably arise purely through co-occurrence modeling, using classical language models like continuous bag of words (CBOW), without needing positional information or attention mechanisms. However, positional information becomes crucial for logic reasoning tasks requiring generalization to unseen tokens. Finally, we identify two cases where ICL fails: one in logic reasoning tasks that require generalizing to new, unseen patterns, and another in analogy completion where relevant word pairs appear only in fixed training positions. These findings suggest that LLMs' ICL abilities depend heavily on the structural elements within their training data.

From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When

TL;DR

This work examines what enables ICL in models trained on unstructured data, and identifies two cases where ICL fails: one in logic reasoning tasks that require generalizing to new, unseen patterns, and another in analogy completion where relevant word pairs appear only in fixed training positions.

Abstract

Large language models (LLMs) like transformers demonstrate impressive in-context learning (ICL) capabilities, allowing them to make predictions for new tasks based on prompt exemplars without parameter updates. While existing ICL theories often assume structured training data resembling ICL tasks (e.g., x-y pairs for linear regression), LLMs are typically trained unsupervised on unstructured text, such as web content, which lacks clear parallels to tasks like word analogy. To address this gap, we examine what enables ICL in models trained on unstructured data, focusing on critical sequence model requirements and training data structure. We find that many ICL capabilities can emerge simply from co-occurrence of semantically related word pairs in unstructured data; word analogy completion, for example, can provably arise purely through co-occurrence modeling, using classical language models like continuous bag of words (CBOW), without needing positional information or attention mechanisms. However, positional information becomes crucial for logic reasoning tasks requiring generalization to unseen tokens. Finally, we identify two cases where ICL fails: one in logic reasoning tasks that require generalizing to new, unseen patterns, and another in analogy completion where relevant word pairs appear only in fixed training positions. These findings suggest that LLMs' ICL abilities depend heavily on the structural elements within their training data.
Paper Structure (28 sections, 8 theorems, 26 equations, 2 figures, 9 tables)

This paper contains 28 sections, 8 theorems, 26 equations, 2 figures, 9 tables.

Key Result

Theorem 1

Let $K, L \geq S \geq 3$. Suppose each training sentence of length $S$ is generated by selecting one $(c_i, d_i)$ pair and $S - 2$ distinct $r_i$'s uniformly at random. We train a CBOW model with the squared loss and a sufficiently large embedding dimension on these sentences. Given a prompt $c_{i_1

Figures (2)

  • Figure 1: This paper identifies essential components for in-context learning (ICL) from pre-training on unstructured natural language data. Left sub-panels, right sub-panels, and boxed letters denote NLP examples, our abstractions, and expected outputs, respectively. Section \ref{['sec:co-occ']} shows that ICL for word analogy tasks can arise via modeling co-occurrence information using classical language models like continuous bag of words (CBOW) (violet represents relationship-specific nuisance tokens). Section \ref{['sec:icl-pos']} establishes the necessity of modeling positional information and blocked nuisance structure for ICL tasks, enabling pattern recognition and generalization to novel tokens (violet represents nuisance tokens). Section \ref{['sec:failed']} presents scenarios where ICL fails, providing theoretical explanations that underscore the critical role of training data structure in enabling ICL in language models.
  • Figure 2: One-layer models fail to differentiate the two patterns in Section \ref{['sec:two-patterns']}, as evidenced by the accuracy trajectory graph on the left. On the other hand, five-layer models are capable of doing so.

Theorems & Definitions (16)

  • Theorem 1: ICL on single-relationship word analogy tasks
  • Theorem 2: Task selection in CBOW
  • Theorem 3: Necessity of modeling positions
  • Proposition 4: Multi-layer models can encode positions
  • Theorem 5: Blocked nuisance token structure facilitates ICL
  • Theorem 6: Failure of ICL: Different repeated patterns
  • Theorem 7: Failure of ICL: Different pattern structures
  • proof
  • Lemma 8
  • proof
  • ...and 6 more