Table of Contents
Fetching ...

Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers

Gilad Yehudai, Clayton Sanford, Maya Bechler-Speicher, Orr Fischer, Ran Gilad-Bachrach, Amir Globerson

TL;DR

This work analyzes depth-width tradeoffs for graph tasks solved by transformers, showing that allowing linear embedding width can yield constant-depth solutions for problems like connectivity and 1 vs 2 cycles, and that further reductions are possible with Chain-of-Thought tokens. It provides concrete upper bounds (e.g., $O(1)$ layers with $m=O(n)$ for 1 vs 2 cycles and $O(1)$ depth with $m=O(n\log^3 n)$ for connectivity) and presents lower-bound limitations in subquadratic memory regimes (e.g., Eulerian-cycle verification requiring $m=\Omega(n^2)$ or $L=\Omega(\log n)$ under a conjecture). The paper also demonstrates that linear memory enables computations such as $A^{L+1}$ in $O(L)$ layers and discusses subgraph-detection strategies with $O(1)$ depth given larger memory, alongside lower-bound arguments from communication complexity. Empirically, adjacency-row embeddings outperform traditional edge-list and Laplacian-based representations on several datasets, supporting the practicality of the proposed approach and underscoring the value of memory-width tuning for efficient graph reasoning with transformers.

Abstract

Transformers have revolutionized the field of machine learning. In particular, they can be used to solve complex algorithmic problems, including graph-based tasks. In such algorithmic tasks a key question is what is the minimal size of a transformer that can implement a task. Recent work has begun to explore this problem for graph-based tasks, showing that for sub-linear embedding dimension (i.e., model width) logarithmic depth suffices. However, an open question, which we address here, is what happens if width is allowed to grow linearly. Here we analyze this setting, and provide the surprising result that with linear width, constant depth suffices for solving a host of graph-based problems. This suggests that a moderate increase in width can allow much shallower models, which are advantageous in terms of inference time. For other problems, we show that quadratic width is required. Our results demonstrate the complex and intriguing landscape of transformer implementations of graph-based algorithms. We support our theoretical results with empirical evaluations.

Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers

TL;DR

This work analyzes depth-width tradeoffs for graph tasks solved by transformers, showing that allowing linear embedding width can yield constant-depth solutions for problems like connectivity and 1 vs 2 cycles, and that further reductions are possible with Chain-of-Thought tokens. It provides concrete upper bounds (e.g., layers with for 1 vs 2 cycles and depth with for connectivity) and presents lower-bound limitations in subquadratic memory regimes (e.g., Eulerian-cycle verification requiring or under a conjecture). The paper also demonstrates that linear memory enables computations such as in layers and discusses subgraph-detection strategies with depth given larger memory, alongside lower-bound arguments from communication complexity. Empirically, adjacency-row embeddings outperform traditional edge-list and Laplacian-based representations on several datasets, supporting the practicality of the proposed approach and underscoring the value of memory-width tuning for efficient graph reasoning with transformers.

Abstract

Transformers have revolutionized the field of machine learning. In particular, they can be used to solve complex algorithmic problems, including graph-based tasks. In such algorithmic tasks a key question is what is the minimal size of a transformer that can implement a task. Recent work has begun to explore this problem for graph-based tasks, showing that for sub-linear embedding dimension (i.e., model width) logarithmic depth suffices. However, an open question, which we address here, is what happens if width is allowed to grow linearly. Here we analyze this setting, and provide the surprising result that with linear width, constant depth suffices for solving a host of graph-based problems. This suggests that a moderate increase in width can allow much shallower models, which are advantageous in terms of inference time. For other problems, we show that quadratic width is required. Our results demonstrate the complex and intriguing landscape of transformer implementations of graph-based algorithms. We support our theoretical results with empirical evaluations.

Paper Structure

This paper contains 22 sections, 6 theorems, 5 equations, 1 table.

Key Result

Theorem 4.1

There exists a transformer with $2$ layers of self-attention, and embedding dimension $O(n)$ that solves the $1$ vs. $2$ cycle problem.

Theorems & Definitions (13)

  • Remark 3.1
  • Theorem 4.1
  • proof
  • Theorem 4.2
  • proof
  • Theorem 5.1
  • proof
  • Lemma 5.2
  • proof
  • Theorem 5.3
  • ...and 3 more