Table of Contents
Fetching ...

How Powerful are Decoder-Only Transformer Neural Models?

Jesse Roberts

TL;DR

It is proved that the general transformer neural model undergirding modern large language models (LLMs) is Turing complete under reasonable assumptions and the sparsity/compressibility of the word embedding is an important consideration for Turing completeness to hold.

Abstract

In this article we prove that the general transformer neural model undergirding modern large language models (LLMs) is Turing complete under reasonable assumptions. This is the first work to directly address the Turing completeness of the underlying technology employed in GPT-x as past work has focused on the more expressive, full auto-encoder transformer architecture. From this theoretical analysis, we show that the sparsity/compressibility of the word embedding is an important consideration for Turing completeness to hold. We also show that Transformers are are a variant of B machines studied by Hao Wang.

How Powerful are Decoder-Only Transformer Neural Models?

TL;DR

It is proved that the general transformer neural model undergirding modern large language models (LLMs) is Turing complete under reasonable assumptions and the sparsity/compressibility of the word embedding is an important consideration for Turing completeness to hold.

Abstract

In this article we prove that the general transformer neural model undergirding modern large language models (LLMs) is Turing complete under reasonable assumptions. This is the first work to directly address the Turing completeness of the underlying technology employed in GPT-x as past work has focused on the more expressive, full auto-encoder transformer architecture. From this theoretical analysis, we show that the sparsity/compressibility of the word embedding is an important consideration for Turing completeness to hold. We also show that Transformers are are a variant of B machines studied by Hao Wang.
Paper Structure (29 sections, 4 theorems, 2 equations, 2 figures)

This paper contains 29 sections, 4 theorems, 2 equations, 2 figures.

Key Result

Theorem 4.1

For any pair of fully connected feed forward neural networks (FFNs) such that the outputs of the first are fed into the inputs of the next, there exists a single FFN whose outputs will be identical to the outputs of the second network.

Figures (2)

  • Figure 1: Vanilla Transformer Architecture. The yellow dashed line surrounds the sections removed to create a Decoder-only Transformer model.
  • Figure 2: Decoder-only (left) and Encoder-only (right) Transformer Architectures. Green boxes are sequences of vectors with the width of the box representing relative sequence length. Red denotes a single vector. Gray and blue boxes denote simple and compound operations respectively.

Theorems & Definitions (4)

  • Theorem 4.1: Single Network replacement of Cascaded Networks
  • Theorem 4.2: FFN Override Input
  • Theorem 4.3: Recognize the stop token
  • Theorem 4.4: Compression of $\mathbf{x}_t$ and $\mathbf{h}_t$ into $\mathbf{r}_t$