Table of Contents
Fetching ...

Rule Extrapolation in Language Models: A Study of Compositional Generalization on OOD Prompts

Anna Mészáros, Szilvia Ujváry, Wieland Brendel, Patrik Reizinger, Ferenc Huszár

TL;DR

This work defines a new scenario of OOD compositional generalization, termed rule extrapolation, and lays the first stones of a normative theory of rule extrapolation, inspired by the Solomonoff prior in algorithmic information theory.

Abstract

LLMs show remarkable emergent abilities, such as inferring concepts from presumably out-of-distribution prompts, known as in-context learning. Though this success is often attributed to the Transformer architecture, our systematic understanding is limited. In complex real-world data sets, even defining what is out-of-distribution is not obvious. To better understand the OOD behaviour of autoregressive LLMs, we focus on formal languages, which are defined by the intersection of rules. We define a new scenario of OOD compositional generalization, termed rule extrapolation. Rule extrapolation describes OOD scenarios, where the prompt violates at least one rule. We evaluate rule extrapolation in formal languages with varying complexity in linear and recurrent architectures, the Transformer, and state space models to understand the architectures' influence on rule extrapolation. We also lay the first stones of a normative theory of rule extrapolation, inspired by the Solomonoff prior in algorithmic information theory.

Rule Extrapolation in Language Models: A Study of Compositional Generalization on OOD Prompts

TL;DR

This work defines a new scenario of OOD compositional generalization, termed rule extrapolation, and lays the first stones of a normative theory of rule extrapolation, inspired by the Solomonoff prior in algorithmic information theory.

Abstract

LLMs show remarkable emergent abilities, such as inferring concepts from presumably out-of-distribution prompts, known as in-context learning. Though this success is often attributed to the Transformer architecture, our systematic understanding is limited. In complex real-world data sets, even defining what is out-of-distribution is not obvious. To better understand the OOD behaviour of autoregressive LLMs, we focus on formal languages, which are defined by the intersection of rules. We define a new scenario of OOD compositional generalization, termed rule extrapolation. Rule extrapolation describes OOD scenarios, where the prompt violates at least one rule. We evaluate rule extrapolation in formal languages with varying complexity in linear and recurrent architectures, the Transformer, and state space models to understand the architectures' influence on rule extrapolation. We also lay the first stones of a normative theory of rule extrapolation, inspired by the Solomonoff prior in algorithmic information theory.
Paper Structure (61 sections, 11 equations, 8 figures, 14 tables)

This paper contains 61 sections, 11 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: Rule extrapolation summary for all models and languages (\ref{['table:langs']}): The Transformer is the best on context-free and context-sensitive languages, whereas the LSTM and Mamba excel on regular languages. We also plot chance-level performance as gray rectangles. Mean accuracies and standard deviations (averaged over 5 seeds)
  • Figure 2: Graphical model representing our approach for OOD prompt completion. Although Bob's lm $p_{\text{data}}$ assigns zero probability to the OOD prompt, it defines a conditional probability distribution for its completions. Our probabilistic model assumes that Bob's lm completes the ID and ood prompt independently, according to the same procedure (e.g. the same lm architecture and parameters are used for generating the completions). This is the same as assuming that the Markov factors marked in blue are the same, i.e. $p(\text{completion} | \text{prompt, } p_{\text{data}})=p(\text{completion} | \text{OOD prompt, } p_{\text{data}}),$ and the conditional independence ood completion $\perp$ ID prompt $\mid$ ood prompt.
  • Figure 3: Training dynamics of rule learning for a Transformer trained on the $a^nb^n$ language: we color-code the log probability of all sequences of length $8$ consisting of $a$'s and $b$'s and ending with *eos at initialization (leftleft), during (leftmiddle) and after training (leftright). The sequences are separated according to which rule they obey. While at initialization, the probabilities are distributed roughly evenly, during training the model starts to assign higher probabilities to sequences satisfying (R2). After training the most likely sequences are the ones in (R1) $\cap$ (R2), the others are negligible. The same trend can be seen on the right, where the normalized sum of the probabilities of the four categories (satisfying (R1) and (R2), only (R1), only (R2) and neither) is plotted during training.
  • Figure 4: Training dynamics of the LSTM Training an LSTM on the $a^nb^n$ language, the normalized probability of all sequences, grouped into the four categories (satisfying (R1) and (R2), only (R1), only (R2) and neither) of length $8$ consisting of $a$'s and $b$'s and ending with *eos is plotted during training. The sequences are separated according to which rule they obey. At initialization, sequences obeying any of the rules have low probability. During training, the model first starts assigning higher probabilities to sequences satisfying (R2), but soon after, sequences in (R1) $\cap$ (R2) dominate. After training the most likely sequences are the ones in (R1) $\cap$ (R2), the others are negligible.
  • Figure 5: Training dynamics of Mamba Training a Mamba architecture on the $a^nb^n$ language, the normalized probability of all sequences, grouped into the four categories (satisfying (R1) and (R2), only (R1), only (R2) and neither) of length $8$ consisting of $a$'s and $b$'s and ending with *eos is plotted during training. The sequences are separated according to which rule they obey. Intriguingly, at initialization, sequences obeying (R2) are assigned largest probability. During training, the model learns (R1) $\cap$ (R2) consistently after 3000 epochs. After training the most likely sequences are the ones in (R1) $\cap$ (R2), the others are negligible.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Definition C.1: Prefix code
  • Definition C.2: Prefix Turing Machine
  • Definition C.3: Semimeasure
  • Remark C.1
  • Definition C.4: Lower semicomputability
  • Definition C.5: (Conditional) prefix Kolmogorov complexity