How Powerful are Decoder-Only Transformer Neural Models?

Jesse Roberts

How Powerful are Decoder-Only Transformer Neural Models?

Jesse Roberts

TL;DR

It is proved that the general transformer neural model undergirding modern large language models (LLMs) is Turing complete under reasonable assumptions and the sparsity/compressibility of the word embedding is an important consideration for Turing completeness to hold.

Abstract

In this article we prove that the general transformer neural model undergirding modern large language models (LLMs) is Turing complete under reasonable assumptions. This is the first work to directly address the Turing completeness of the underlying technology employed in GPT-x as past work has focused on the more expressive, full auto-encoder transformer architecture. From this theoretical analysis, we show that the sparsity/compressibility of the word embedding is an important consideration for Turing completeness to hold. We also show that Transformers are are a variant of B machines studied by Hao Wang.

How Powerful are Decoder-Only Transformer Neural Models?

TL;DR

Abstract

Paper Structure (29 sections, 4 theorems, 2 equations, 2 figures)

This paper contains 29 sections, 4 theorems, 2 equations, 2 figures.

Introduction
Background
Disambiguating Decoder-Only Transformer Models
Modifying the Vanilla Transformer to form a Decoder-only Model
Differentiating Encoder-only and Decoder-only Models
Related Theoretical Work on Transformers
Required Conventions Inherited from Vanilla Transformers
Definitions & Approach
Embedding & Position
Decoder-only Transformer Architecture
Self-Attention
Feed Forward Network
Single Layer Decoder-Only Models
Multi-Layer Decoder-Only Models
Proof Approach
...and 14 more sections

Key Result

Theorem 4.1

For any pair of fully connected feed forward neural networks (FFNs) such that the outputs of the first are fed into the inputs of the next, there exists a single FFN whose outputs will be identical to the outputs of the second network.

Figures (2)

Figure 1: Vanilla Transformer Architecture. The yellow dashed line surrounds the sections removed to create a Decoder-only Transformer model.
Figure 2: Decoder-only (left) and Encoder-only (right) Transformer Architectures. Green boxes are sequences of vectors with the width of the box representing relative sequence length. Red denotes a single vector. Gray and blue boxes denote simple and compound operations respectively.

Theorems & Definitions (4)

Theorem 4.1: Single Network replacement of Cascaded Networks
Theorem 4.2: FFN Override Input
Theorem 4.3: Recognize the stop token
Theorem 4.4: Compression of $\mathbf{x}_t$ and $\mathbf{h}_t$ into $\mathbf{r}_t$

How Powerful are Decoder-Only Transformer Neural Models?

TL;DR

Abstract

How Powerful are Decoder-Only Transformer Neural Models?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (4)