Table of Contents
Fetching ...

Compositional Reasoning with Transformers, RNNs, and Chain of Thought

Gilad Yehudai, Noah Amsel, Joan Bruna

TL;DR

The paper analyzes the expressive power of transformers, RNNs, and transformers augmented with chain-of-thought tokens for Compositional Reasoning Questions (CRQ). By linking problem hardness to architectural capabilities, it shows depth-$L$ transformers can solve CRQs up to size $n$ with bit-precision $O(\log n)$, while constant-depth transformers are $\NC^1$-hard via a BFEP reduction. It further proves that shallow RNNs can solve CRQs under favorable input orders with hidden-dimension $O(d\log n)$, but any ordering demands at least $\Omega(\log n)$ hidden-state capacity in general, and it demonstrates that shallow transformers with chain-of-thought tokens can also solve CRQs with constant depth. Altogether, the results reveal complementary strengths and trade-offs across architectures for compositional reasoning tasks.

Abstract

We study and compare the expressive power of transformers, RNNs, and transformers with chain of thought tokens on a simple and natural class of problems we term Compositional Reasoning Questions (CRQ). This family captures problems like evaluating Boolean formulas and multi-step word problems. Assuming standard hardness assumptions from circuit complexity and communication complexity, we prove that none of these three architectures is capable of solving CRQs unless some hyperparameter (depth, embedding dimension, and number of chain of thought tokens, respectively) grows with the size of the input. We also provide a construction for each architecture that solves CRQs. For transformers, our construction uses depth that is logarithmic in the problem size. For RNNs, logarithmic embedding dimension is necessary and sufficient, so long as the inputs are provided in a certain order. (Otherwise, a linear dimension is necessary). For transformers with chain of thought, our construction uses $n$ CoT tokens. These results show that, while CRQs are inherently hard, there are several different ways for language models to overcome this hardness. Even for a single class of problems, each architecture has strengths and weaknesses, and none is strictly better than the others.

Compositional Reasoning with Transformers, RNNs, and Chain of Thought

TL;DR

The paper analyzes the expressive power of transformers, RNNs, and transformers augmented with chain-of-thought tokens for Compositional Reasoning Questions (CRQ). By linking problem hardness to architectural capabilities, it shows depth- transformers can solve CRQs up to size with bit-precision , while constant-depth transformers are -hard via a BFEP reduction. It further proves that shallow RNNs can solve CRQs under favorable input orders with hidden-dimension , but any ordering demands at least hidden-state capacity in general, and it demonstrates that shallow transformers with chain-of-thought tokens can also solve CRQs with constant depth. Altogether, the results reveal complementary strengths and trade-offs across architectures for compositional reasoning tasks.

Abstract

We study and compare the expressive power of transformers, RNNs, and transformers with chain of thought tokens on a simple and natural class of problems we term Compositional Reasoning Questions (CRQ). This family captures problems like evaluating Boolean formulas and multi-step word problems. Assuming standard hardness assumptions from circuit complexity and communication complexity, we prove that none of these three architectures is capable of solving CRQs unless some hyperparameter (depth, embedding dimension, and number of chain of thought tokens, respectively) grows with the size of the input. We also provide a construction for each architecture that solves CRQs. For transformers, our construction uses depth that is logarithmic in the problem size. For RNNs, logarithmic embedding dimension is necessary and sufficient, so long as the inputs are provided in a certain order. (Otherwise, a linear dimension is necessary). For transformers with chain of thought, our construction uses CoT tokens. These results show that, while CRQs are inherently hard, there are several different ways for language models to overcome this hardness. Even for a single class of problems, each architecture has strengths and weaknesses, and none is strictly better than the others.

Paper Structure

This paper contains 10 sections, 9 theorems, 21 equations, 1 figure, 1 algorithm.

Key Result

Theorem 2.1

For any $L,n\in \mathbb{N}$ there exists a depth-$L$ transformer $T$ that solves all the compositional reasoning questions up to $n$ nodes and depth $L$. The bit-precision of the transformer is $O(\log(n))$.

Figures (1)

  • Figure 1: Left: Construction for $\neg \varphi$. Right: Construction for $\varphi_1 \land \varphi_2$ and $\varphi_1 \lor \varphi_2$

Theorems & Definitions (20)

  • Theorem 2.1
  • Lemma 2.2
  • proof
  • proof : Proof of \ref{['thm:deep transformer']}
  • Theorem 3.1
  • proof
  • Theorem 4.1
  • proof
  • Claim 4.2: Lower bound for set disjointness yao1979some
  • Definition 4.3: Memory Rank
  • ...and 10 more