Simulation of Graph Algorithms with Looped Transformers

Artur Back de Luca; Kimon Fountoulakis

Simulation of Graph Algorithms with Looped Transformers

Artur Back de Luca, Kimon Fountoulakis

TL;DR

This work study the ability of transformer networks to simulate algorithms on graphs from a theoretical perspective by construction that this architecture can simulate individual algorithms such as Dijkstra's shortest path, Breadth- and Depth-First Search, and Kosaraju's strongly connected components, as well as multiple algorithms simultaneously.

Abstract

The execution of graph algorithms using neural networks has recently attracted significant interest due to promising empirical progress. This motivates further understanding of how neural networks can replicate reasoning steps with relational data. In this work, we study the ability of transformer networks to simulate algorithms on graphs from a theoretical perspective. The architecture we use is a looped transformer with extra attention heads that interact with the graph. We prove by construction that this architecture can simulate individual algorithms such as Dijkstra's shortest path, Breadth- and Depth-First Search, and Kosaraju's strongly connected components, as well as multiple algorithms simultaneously. The number of parameters in the networks does not increase with the input graph size, which implies that the networks can simulate the above algorithms for any graph. Despite this property, we show a limit to simulation in our solution due to finite precision. Finally, we show a Turing Completeness result with constant width when the extra attention heads are utilized.

Simulation of Graph Algorithms with Looped Transformers

TL;DR

Abstract

Paper Structure (73 sections, 5 theorems, 52 equations, 3 figures, 1 table, 10 algorithms)

This paper contains 73 sections, 5 theorems, 52 equations, 3 figures, 1 table, 10 algorithms.

Introduction
Related work
Preliminaries
The Architecture
Discussion on the attention head in \ref{['eq:attention_head']}
Simulation Details
Input matrix
Less-than
Read row from $A$
Theoretical analysis
Positional encodings and increment
Maximum absolute value parameter
Results on simulation of graph algorithms
Turing Completeness
Training Limitations in Algorithm Simulation
...and 58 more sections

Key Result

Theorem 6.1

There exists a looped-transformer $h_T$ in the form of eq:layer, with 17 layers, 3 attention heads, and layer width $O(1)$ that simulates Dijkstra's shortest path algorithm for weighted graphs with rational edge-weights, up to $O(\hat{\delta}^{-1})$ nodes and graph diameter of $O(\Omega\varepsilon)$

Figures (3)

Figure 1: A simplified illustration of the simulation of Dijkstra's algorithm using a looped transformer with extra attention heads that interact with the graph. On the left, we display the pseudocode of Dijkstra's algorithm, serving as the source code. The rightmost section shows the corresponding simulation via a transformer, where each step of the source code in the loop is simulated using one or more transformer blocks. We specifically focus on lines 11 and 12, highlighted to demonstrate the simulation of individual functions, as discussed in \ref{['sec:simulation']}. At the top center of the figure, the encoding of graph information and variable scopes into $\Tilde{A}$ and $X$ is depicted. For clarity, $X$ is shown in its transposed format. Throughout the transformer loop, $\Tilde{A}$ remains constant, while $X$ is updated in each iteration until the simulation meets its termination criteria. Upon termination, the decoding step extracts columns from $X$ that correspond to the algorithm's desired output.
Figure 2: Illustration of the input structure used in the simulation of graph algorithms. On the left, columns indicate node positions using circular positional encodings liu2022transformers, detailed in \ref{['sec:theory']}. The next blocks on the right denote the global (in red) and local (in green) variables, which occupy the top and last $n$ rows of $X$, respectively. In these blocks, the first column marks the bias of the corresponding variables. The symbol $t$ indicates the termination flag, while symbols $z$, and $x_i, y_i$, are generic local and global variables, respectively. Finally, depicted in the far right block, the scratchpad is used for temporary storage and calculation. Non-shaded areas remain null during execution.
Figure 3: Condition number of linear layers in the constructions of the if-else (left) and less-than (right) functions. The x-axes indicate the value of the construction parameters $\Omega$ and $\varepsilon$ of \ref{['eq:if_else']} and \ref{['eq:less_than']}, respectively. A higher $\Omega$ improves the simulation quality for the if-else function, while a smaller $\varepsilon$ improves the simulation quality for the less-than function.

Theorems & Definitions (11)

Definition 3.1: Simulation
Theorem 6.1
Theorem 6.2
Theorem 6.3
Theorem 6.4
Remark 6.5
Remark 6.6
Lemma C.1
proof
Remark C.2
...and 1 more

Simulation of Graph Algorithms with Looped Transformers

TL;DR

Abstract

Simulation of Graph Algorithms with Looped Transformers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (11)