Table of Contents
Fetching ...

Limits of Transformer Language Models on Learning to Compose Algorithms

Jonathan Thomm, Giacomo Camposampiero, Aleksandar Terzic, Michael Hersche, Bernhard Schölkopf, Abbas Rahimi

TL;DR

The results indicate that compositional learning in state-of-the-art Transformer language models is highly sample inefficient: LLaMA requires more data samples than relearning all sub-tasks from scratch to learn the compositional task; in-context prompting with few samples is unreliable and fails at executing the sub-tasks or correcting the errors in multi-round code generation.

Abstract

We analyze the capabilities of Transformer language models in learning compositional discrete tasks. To this end, we evaluate training LLaMA models and prompting GPT-4 and Gemini on four tasks demanding to learn a composition of several discrete sub-tasks. In particular, we measure how well these models can reuse primitives observable in the sub-tasks to learn the composition task. Our results indicate that compositional learning in state-of-the-art Transformer language models is highly sample inefficient: LLaMA requires more data samples than relearning all sub-tasks from scratch to learn the compositional task; in-context prompting with few samples is unreliable and fails at executing the sub-tasks or correcting the errors in multi-round code generation. Further, by leveraging complexity theory, we support these findings with a theoretical analysis focused on the sample inefficiency of gradient descent in memorizing feedforward models. We open source our code at https://github.com/IBM/limitations-lm-algorithmic-compositional-learning.

Limits of Transformer Language Models on Learning to Compose Algorithms

TL;DR

The results indicate that compositional learning in state-of-the-art Transformer language models is highly sample inefficient: LLaMA requires more data samples than relearning all sub-tasks from scratch to learn the compositional task; in-context prompting with few samples is unreliable and fails at executing the sub-tasks or correcting the errors in multi-round code generation.

Abstract

We analyze the capabilities of Transformer language models in learning compositional discrete tasks. To this end, we evaluate training LLaMA models and prompting GPT-4 and Gemini on four tasks demanding to learn a composition of several discrete sub-tasks. In particular, we measure how well these models can reuse primitives observable in the sub-tasks to learn the composition task. Our results indicate that compositional learning in state-of-the-art Transformer language models is highly sample inefficient: LLaMA requires more data samples than relearning all sub-tasks from scratch to learn the compositional task; in-context prompting with few samples is unreliable and fails at executing the sub-tasks or correcting the errors in multi-round code generation. Further, by leveraging complexity theory, we support these findings with a theoretical analysis focused on the sample inefficiency of gradient descent in memorizing feedforward models. We open source our code at https://github.com/IBM/limitations-lm-algorithmic-compositional-learning.
Paper Structure (39 sections, 6 theorems, 3 equations, 6 figures, 24 tables)

This paper contains 39 sections, 6 theorems, 3 equations, 6 figures, 24 tables.

Key Result

Theorem 4.1

There exist (many) series of train-test datasets, such that feedforward models on gradient descent that memorize samples will need $O(n^k)$ times more data than an optimal learner. Otherwise, they will not generalize to the test set. $k$ is any positive number and $n$ is the index in the series.

Figures (6)

  • Figure 1: Translation of a compositional algorithmic task $A$, PEN (see Section \ref{['sec:tasks-intro']} for details), into its corresponding compositional graph $G_{A(x)}$, for the input $x=\text{"ab xy ab4fq wv7ql"}$. The operations (edges) are color-matched with the respective operations in the pseudo-code of the algorithmic task $A(x)$.
  • Figure 2: Introduced compositional algorithmic tasks.Left: The Pointer Execution (PE)'s neighbor (PEN), together with the Pointer Execution (PE) and Pointer Execution Verbose (PEV) sub-tasks. Starting left, the output is obtained by matching words and predicting the current word (in PE) or its neighbor (in PEV and PEN). Our matching criterion is that the two end characters of the current word are equal to the first two characters of the matched word. By ensuring that there are no ambiguities in the input string, an attention mechanism can find the match by retrieving the last two characters of the word and matching it with the (unique) word that starts with them. Right: The Pointer Execution Reverse Multicount (PERM), together with the Pointer Execution (PE) and Pointer Execution Reverse (PER) sub-tasks. PERM first outputs the last word in the matching sequence and then goes backward. The number in the answer for each word is the count of matches times the count of left matches (i.e., arrow to the left in the forward matching sequence).
  • Figure 3: Accuracy of LLaMA models on PEN, PERM, HSS, and MUL, and their respective sub-tasks. While LLaMa achieves perfect accuracy on all the individual sub-tasks, it needs much larger amounts of training data to learn their composition. This observation makes hypothesis $\mathcal{H}_2$ (learning the composition requires less samples than the hardest sub-task, green) and $\mathcal{H}_3$ (learning the composition requires less samples than the sum of the sub-tasks, yellow) impossible to achieve on every task, where LLaMa seems to always fall into $\mathcal{H}_4$ (learning the composition requires more samples than the sum of the sub-tasks, red). Moreover, training together with the sub-tasks does not seem to perform sensibly better than training only on the main task (i.e., w/o sub-tasks).
  • Figure A.4: Intuition for our statement: Based on widely accepted assumptions, there are problems requiring much more learning computation time than usage computation time (here: per sample available). Gradient Descent on Feedforward Networks under the assumption of constant-step-memorization however, will only be able to learn problems as hard to learn as they are during inference. Therefore, the model size needs to be much larger than necessary, or the training samples need to be much larger than necessary, or gradient descent will have trouble learning.
  • Figure B.5: Matchings of the yellow words: The (yellow) neighbors of matched green words build a matching sequence themselves (in their own order) and have exactly one matching outside the neighbors each (except the last yellow word in the order of the yellow sequence, which has no matching). Therefore, each yellow neighbor except the last matches to two words in the sequence. The blue arrows are over the remaining green positions (which are not part of any answer), which also build a matching sequence. Those additional constraints remove shortcut solutions.
  • ...and 1 more figures

Theorems & Definitions (15)

  • Theorem 4.1: Informal Theorem \ref{['corollary:many-samples-needed']}
  • Definition A.1: $\mathcal{K}$
  • Definition A.2: Concept, Concept Class
  • Definition A.3: Application of a Concept
  • Definition A.4: Learning Algorithm
  • Theorem A.5
  • Definition A.6: Constant Memorization
  • Corollary A.7
  • Theorem
  • proof
  • ...and 5 more