Table of Contents
Fetching ...

Transformers are Efficient Compilers, Provably

Xiyu Zhai, Runlong Zhou, Liao Zhang, Simon Shaolei Du

TL;DR

The paper examines whether transformers can function as efficient compilers by formalizing compiler tasks for a C-like language, Mini-Husky. It develops Cybertron, a typed DSL to express and prove that transformers can perform AST construction, symbol resolution, and type analysis with parameter counts that scale logarithmically with input length, under bounded-depth assumptions. It also proves an exponential separation from RNNs in type-analysis tasks and provides empirical validation on synthetic data. The results indicate a principled, provable role for transformers as compilers, with potential impact on program analysis, verification, and code-processing pipelines.

Abstract

Transformer-based large language models (LLMs) have demonstrated surprisingly robust performance across a wide range of language-related tasks, including programming language understanding and generation. In this paper, we take the first steps towards a formal investigation of using transformers as compilers from an expressive power perspective. To this end, we introduce a representative programming language, Mini-Husky, which encapsulates key features of modern C-like languages. We show that if the input code sequence has a bounded depth in both the Abstract Syntax Tree (AST) and type inference (reasonable assumptions based on the clean code principle), then the number of parameters required by transformers depends only on the logarithm of the input sequence length to handle compilation tasks, such as AST construction, symbol resolution, and type analysis. A significant technical challenge stems from the fact that transformers operate at a low level, where each layer processes the input sequence as raw vectors without explicitly associating them with predefined structure or meaning. In contrast, high-level compiler tasks necessitate managing intricate relationships and structured program information. Our primary technical contribution is the development of a domain-specific language, Cybertron, which generates formal proofs of the transformer's expressive power, scaling to address compiler tasks. We further establish that recurrent neural networks (RNNs) require at least a linear number of parameters relative to the input sequence, leading to an exponential separation between transformers and RNNs. Finally, we empirically validate our theoretical results by comparing transformers and RNNs on compiler tasks within Mini-Husky.

Transformers are Efficient Compilers, Provably

TL;DR

The paper examines whether transformers can function as efficient compilers by formalizing compiler tasks for a C-like language, Mini-Husky. It develops Cybertron, a typed DSL to express and prove that transformers can perform AST construction, symbol resolution, and type analysis with parameter counts that scale logarithmically with input length, under bounded-depth assumptions. It also proves an exponential separation from RNNs in type-analysis tasks and provides empirical validation on synthetic data. The results indicate a principled, provable role for transformers as compilers, with potential impact on program analysis, verification, and code-processing pipelines.

Abstract

Transformer-based large language models (LLMs) have demonstrated surprisingly robust performance across a wide range of language-related tasks, including programming language understanding and generation. In this paper, we take the first steps towards a formal investigation of using transformers as compilers from an expressive power perspective. To this end, we introduce a representative programming language, Mini-Husky, which encapsulates key features of modern C-like languages. We show that if the input code sequence has a bounded depth in both the Abstract Syntax Tree (AST) and type inference (reasonable assumptions based on the clean code principle), then the number of parameters required by transformers depends only on the logarithm of the input sequence length to handle compilation tasks, such as AST construction, symbol resolution, and type analysis. A significant technical challenge stems from the fact that transformers operate at a low level, where each layer processes the input sequence as raw vectors without explicitly associating them with predefined structure or meaning. In contrast, high-level compiler tasks necessitate managing intricate relationships and structured program information. Our primary technical contribution is the development of a domain-specific language, Cybertron, which generates formal proofs of the transformer's expressive power, scaling to address compiler tasks. We further establish that recurrent neural networks (RNNs) require at least a linear number of parameters relative to the input sequence, leading to an exponential separation between transformers and RNNs. Finally, we empirically validate our theoretical results by comparing transformers and RNNs on compiler tasks within Mini-Husky.

Paper Structure

This paper contains 68 sections, 20 theorems, 34 equations, 7 figures, 2 tables.

Key Result

Theorem 1

There exists a transformer encoder of model dimension and number of layers being $O(\log L+D)$ and number of heads being $O(1)$ that represents a function that maps any token sequence of length $L$ in $\operatorname{MiniHusky}_{D}$ to its abstract syntax tree represented as a sequence.

Figures (7)

  • Figure 1: Programming language processing pipeline
  • Figure 2: Figures depicting the accuracy of the expected type (see Section \ref{['sec:type_checking']}) across different models, measured by their number of trainable parameters, when trained on various datasets. Training accuracies are better indicators of the expressive power of the models (instead of generalizability) than evaluation accuracies. We also report evaluation accuracies in Appendix \ref{['sec:additional-experiments']}.
  • Figure 3: Transformation from $\phi_{\mathcal{T}}(x)$ to $\phi_{\mathcal{S}}(f(x))$ to $\phi_{\mathcal{R}}(g(f(x)))$ with MLP layers.
  • Figure 4: Figures for the dataset with $(f, {a, c,}\ d, v, e) = (10, {5, 5,}\ 3, 0.2, 0.5)$.
  • Figure 5: Figures for the dataset with $(f, {a, c,}\ d, v, e) = (20, {5, 5,}\ 3, 0.2, 0.5)$.
  • ...and 2 more figures

Theorems & Definitions (82)

  • Definition 1: code with Bounded AST-Depth
  • Theorem 1
  • proof : Proof Sketch
  • Theorem 2
  • proof : Proof Sketch
  • Definition 2: code with Bounded AST-Depth and Type-Inference-Depth
  • Theorem 3
  • Theorem 4
  • Definition 3: Tree
  • Definition 4: Recursive Definition of a Tree
  • ...and 72 more