Transformers learn variable-order Markov chains in-context

Ruida Zhou; Chao Tian; Suhas Diggavi

Transformers learn variable-order Markov chains in-context

Ruida Zhou, Chao Tian, Suhas Diggavi

TL;DR

This work studies the ICL of VOMC by viewing language modeling as a form of data compression and focus on small alphabets and low-order VOMCs, and implements synthetic transformer layers that can match the ICL performance of transformers, and more interestingly, some of them can perform even better despite the much-reduced parameter sets.

Abstract

Large language models have demonstrated impressive in-context learning (ICL) capability. However, it is still unclear how the underlying transformers accomplish it, especially in more complex scenarios. Toward this goal, several recent works studied how transformers learn fixed-order Markov chains (FOMC) in context, yet natural languages are more suitably modeled by variable-order Markov chains (VOMC), i.e., context trees (CTs). In this work, we study the ICL of VOMC by viewing language modeling as a form of data compression and focus on small alphabets and low-order VOMCs. This perspective allows us to leverage mature compression algorithms, such as context-tree weighting (CTW) and prediction by partial matching (PPM) algorithms as baselines, the former of which is Bayesian optimal for a class of CTW priors. We empirically observe a few phenomena: 1) Transformers can indeed learn to compress VOMC in-context, while PPM suffers significantly; 2) The performance of transformers is not very sensitive to the number of layers, and even a two-layer transformer can learn in-context quite well; and 3) Transformers trained and tested on non-CTW priors can significantly outperform the CTW algorithm. To explain these phenomena, we analyze the attention map of the transformers and extract two mechanisms, on which we provide two transformer constructions: 1) A construction with $D+2$ layers that can mimic the CTW algorithm accurately for CTs of maximum order $D$, 2) A 2-layer transformer that utilizes the feed-forward network for probability blending. One distinction from the FOMC setting is that a counting mechanism appears to play an important role. We implement these synthetic transformer layers and show that such hybrid transformers can match the ICL performance of transformers, and more interestingly, some of them can perform even better despite the much-reduced parameter sets.

Transformers learn variable-order Markov chains in-context

TL;DR

Abstract

layers that can mimic the CTW algorithm accurately for CTs of maximum order

, 2) A 2-layer transformer that utilizes the feed-forward network for probability blending. One distinction from the FOMC setting is that a counting mechanism appears to play an important role. We implement these synthetic transformer layers and show that such hybrid transformers can match the ICL performance of transformers, and more interestingly, some of them can perform even better despite the much-reduced parameter sets.

Paper Structure (18 sections, 5 theorems, 34 equations, 11 figures, 2 tables)

This paper contains 18 sections, 5 theorems, 34 equations, 11 figures, 2 tables.

Introduction
Related works
Preliminaries
The Transformer Model
In-context learning as Bayesian universal coding
Context Tree Models (Variable-Order Markov Chains)
Bayesian Context Tree Weighting Compression Algorithm
Prediction by Partial Matching
Transformers Learn In-context of VOMCs
Transformers Can Learn VOMC In-Context
Transformers vs. CTW under Non-CTW-Priors
Theoretical Interpretations and Empirical Evidences
Analysis of Attention Maps
Capability and capacity of transformer via construction
A representation of CTW optimal next token prediction
...and 3 more sections

Key Result

Theorem 2.1

kontoyiannis2022bayesian The $p^{w}_{n,()}$ value at root computed by the CTW procedure equals to the Bayesian predicted probability under prior $\pi_{\text{CTW}}$ specified by $(D, \lambda, \boldsymbol{\alpha})$:

Figures (11)

Figure 1: Transformer model
Figure 2: ICL v.s. Bayesian Universal Coding
Figure 3: A CT in the alphabet $\mathcal{A}=\{a,b,c\}$ with suffix set $\mathcal{S}=\{(b),(c),(a,a),(b,a),(c,a)\}$ and the associated probability distributions. If $(\ldots,x_{n-1},x_n)=(\ldots,c,a)$, then the probability distribution for the next symbol $x_{n+1}$ is $p_{c,a}$.
Figure 4: Training data collection
Figure 5: Transformer, PPM, CTW
...and 6 more figures

Theorems & Definitions (9)

Theorem 2.1
Theorem 4.1
proof : Proof of Theorem \ref{['thm:main-new-formula']}
Theorem 4.2
proof : Proof of Theorem \ref{['thm:extension']}
Theorem 4.3
proof : Proof of Theorem \ref{['thm:statistics']}
Theorem 4.4
proof : Proof of Theorem \ref{['thm:induction']}

Transformers learn variable-order Markov chains in-context

TL;DR

Abstract

Transformers learn variable-order Markov chains in-context

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (9)