Table of Contents
Fetching ...

Towards Understanding the Relationship between In-context Learning and Compositional Generalization

Sungjun Han, Sebastian Padó

TL;DR

It is hypothesize that forcing models to in-context learn can provide an inductive bias to promote compositional generalization and indicates the usefulness of in-context learning problems as an inductive bias for generalization.

Abstract

According to the principle of compositional generalization, the meaning of a complex expression can be understood as a function of the meaning of its parts and of how they are combined. This principle is crucial for human language processing and also, arguably, for NLP models in the face of out-of-distribution data. However, many neural network models, including Transformers, have been shown to struggle with compositional generalization. In this paper, we hypothesize that forcing models to in-context learn can provide an inductive bias to promote compositional generalization. To test this hypothesis, we train a causal Transformer in a setting that renders ordinary learning very difficult: we present it with different orderings of the training instance and shuffle instance labels. This corresponds to training the model on all possible few-shot learning problems attainable from the dataset. The model can solve the task, however, by utilizing earlier examples to generalize to later ones (i.e. in-context learning). In evaluations on the datasets, SCAN, COGS, and GeoQuery, models trained in this manner indeed show improved compositional generalization. This indicates the usefulness of in-context learning problems as an inductive bias for generalization.

Towards Understanding the Relationship between In-context Learning and Compositional Generalization

TL;DR

It is hypothesize that forcing models to in-context learn can provide an inductive bias to promote compositional generalization and indicates the usefulness of in-context learning problems as an inductive bias for generalization.

Abstract

According to the principle of compositional generalization, the meaning of a complex expression can be understood as a function of the meaning of its parts and of how they are combined. This principle is crucial for human language processing and also, arguably, for NLP models in the face of out-of-distribution data. However, many neural network models, including Transformers, have been shown to struggle with compositional generalization. In this paper, we hypothesize that forcing models to in-context learn can provide an inductive bias to promote compositional generalization. To test this hypothesis, we train a causal Transformer in a setting that renders ordinary learning very difficult: we present it with different orderings of the training instance and shuffle instance labels. This corresponds to training the model on all possible few-shot learning problems attainable from the dataset. The model can solve the task, however, by utilizing earlier examples to generalize to later ones (i.e. in-context learning). In evaluations on the datasets, SCAN, COGS, and GeoQuery, models trained in this manner indeed show improved compositional generalization. This indicates the usefulness of in-context learning problems as an inductive bias for generalization.
Paper Structure (45 sections, 1 equation, 4 figures, 4 tables)

This paper contains 45 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of our meta-in-context learning framework. (Left) We build our meta-task distribution by sampling random linear orderings from a sequence to sequence dataset and concatenating the input-output mappings (i.e., $(x_i, y_i)$). We optionally shuffle the labels to eliminate memorization and keep only $M$ examples. A causal Transformer ($t_\theta$) is trained with these concatenated results for next-token prediction, only predicting for the outputs. $\phi$ refers to the pad-token. (Right) At inference, we freeze the weights and randomly sample $k < M$ train examples to use as a context in predicting the test query $x_{query}$.
  • Figure 2: Exp. 2: Models trained on different lengths of trajectories (i.e., $M$), with $k=M-1$. $M=1$ is equivalent to the causal Transformer baseline. Dotted lines: models without label shuffling.
  • Figure 3: Exp. 3: Models evaluated on different numbers of support examples $k$. Lines differ in $M$ (max. roll-out length of meta-training trajectories). Dotted lines: models without label shuffling.
  • Figure 4: Exp. 4: Models evaluated with different numbers of support examples $k$ sampled from the held-out portion of the test set for both with label shuffling (LB) (left column) and without (right). Relative improvement (RI) is calculated using $\frac{new-old}{old}\times 100$.