Table of Contents
Fetching ...

Do Efficient Transformers Really Save Computation?

Kai Yang, Jan Ackermann, Zhenyu He, Guhao Feng, Bohang Zhang, Yunzhen Feng, Qiwei Ye, Di He, Liwei Wang

TL;DR

This paper aims to understand the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer, and identifies a class of DP problems for which these models can be more efficient than the standard Transformer.

Abstract

As transformer-based language models are trained on increasingly large datasets and with vast numbers of parameters, finding more efficient alternatives to the standard Transformer has become very valuable. While many efficient Transformers and Transformer alternatives have been proposed, none provide theoretical guarantees that they are a suitable replacement for the standard Transformer. This makes it challenging to identify when to use a specific model and what directions to prioritize for further investigation. In this paper, we aim to understand the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer. We focus on their reasoning capability as exhibited by Chain-of-Thought (CoT) prompts and follow previous works to model them as Dynamic Programming (DP) problems. Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size. Nonetheless, we identify a class of DP problems for which these models can be more efficient than the standard Transformer. We confirm our theoretical results through experiments on representative DP tasks, adding to the understanding of efficient Transformers' practical strengths and weaknesses.

Do Efficient Transformers Really Save Computation?

TL;DR

This paper aims to understand the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer, and identifies a class of DP problems for which these models can be more efficient than the standard Transformer.

Abstract

As transformer-based language models are trained on increasingly large datasets and with vast numbers of parameters, finding more efficient alternatives to the standard Transformer has become very valuable. While many efficient Transformers and Transformer alternatives have been proposed, none provide theoretical guarantees that they are a suitable replacement for the standard Transformer. This makes it challenging to identify when to use a specific model and what directions to prioritize for further investigation. In this paper, we aim to understand the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer. We focus on their reasoning capability as exhibited by Chain-of-Thought (CoT) prompts and follow previous works to model them as Dynamic Programming (DP) problems. Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size. Nonetheless, we identify a class of DP problems for which these models can be more efficient than the standard Transformer. We confirm our theoretical results through experiments on representative DP tasks, adding to the understanding of efficient Transformers' practical strengths and weaknesses.
Paper Structure (25 sections, 26 theorems, 23 equations, 3 figures, 2 tables)

This paper contains 25 sections, 26 theorems, 23 equations, 3 figures, 2 tables.

Key Result

Theorem 4.1

Consider any DP problem defined above that satisfies the assumptions from appendix:proof_secDP. For any integer $n>0$, there exists a log-precision autoregressive Transformer with a constant depth $M$, a constant hidden dimension $D$, and a constant number of attention heads $H$ (independent of $n$)

Figures (3)

  • Figure 1: A comparison of accuracies across different tasks and model types. Each column corresponds to a task (Arithmetic, ED, LIS), and each row to a model (Standard Transformer, Linear Transformer, Sparse Transformer). Within each subplot, the x-axis represents the embedding dimension, and the y-axis denotes the problem size. The color intensities indicate the accuracy level achieved by the respective models. The figure demonstrates that efficient Transformers need larger hidden dimensions and that this requirement increases with problem size. It also highlights how standard Transformers can handle tasks across all difficulty levels with fixed embedding dimensions.
  • Figure 2: Accuracies of the Sparse Transformer and Linear Transformer on the ED_Local task with varying problem size on the y- and embedding dimension on the x-axis. We can observe that both models benefit from locality.
  • Figure 3: A comparison of accuracies on different model types with ED task. Each subplot corresponds to a model (Standard Transformer, Linear Transformer, Sparse Transformer). Within each subplot, the x-axis represents the embedding dimension, and the y-axis denotes the problem size. The color intensities indicate the accuracy level achieved by the respective models.

Theorems & Definitions (40)

  • Theorem 4.1: informal
  • Theorem 4.2
  • Definition 4.3
  • Theorem 4.4
  • Proposition 5.1: informal
  • Proposition 5.2: informal
  • Definition 5.3: $m$-locality DP
  • Proposition 5.4
  • Proposition 5.5
  • Lemma 1.1: From feng2023towards
  • ...and 30 more