Table of Contents
Fetching ...

ENTP: Encoder-only Next Token Prediction

Ethan Ewer, Daewon Chae, Thomas Zeng, Jinkyu Kim, Kangwook Lee

TL;DR

ENTP investigates encoder-only Transformers for next-token prediction as an alternative to standard decoder-only models, arguing that full attention with recomputation can yield greater expressive power under abundant compute. The work provides theoretical results showing encoder- and decoder-only architectures express different causal function classes, backed by the Count3 task where ENTP succeeds while decoders fail. Empirically, ENTP demonstrates superior or competitive performance on addition, in-context learning, and OpenWebText-style language modeling, though it remains computationally expensive. The findings suggest new directions for architecture design that balance expressiveness and efficiency, potentially guiding future compute-aware language model development.

Abstract

Next-token prediction is conventionally done using decoder-only Transformers with causal attention, as this approach allows for efficient reuse of keys and values. What if we were not compute-limited, should we still use decoder-only Transformers? In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP in settings with unbounded compute. We introduce the $\operatorname{Count3}$ task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate the superior performance of ENTP across representative tasks where next-token prediction based Transformers can be evaluated, including addition, in-context learning, and language modeling.

ENTP: Encoder-only Next Token Prediction

TL;DR

ENTP investigates encoder-only Transformers for next-token prediction as an alternative to standard decoder-only models, arguing that full attention with recomputation can yield greater expressive power under abundant compute. The work provides theoretical results showing encoder- and decoder-only architectures express different causal function classes, backed by the Count3 task where ENTP succeeds while decoders fail. Empirically, ENTP demonstrates superior or competitive performance on addition, in-context learning, and OpenWebText-style language modeling, though it remains computationally expensive. The findings suggest new directions for architecture design that balance expressiveness and efficiency, potentially guiding future compute-aware language model development.

Abstract

Next-token prediction is conventionally done using decoder-only Transformers with causal attention, as this approach allows for efficient reuse of keys and values. What if we were not compute-limited, should we still use decoder-only Transformers? In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP in settings with unbounded compute. We introduce the task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate the superior performance of ENTP across representative tasks where next-token prediction based Transformers can be evaluated, including addition, in-context learning, and language modeling.
Paper Structure (51 sections, 4 theorems, 17 equations, 11 figures, 7 tables, 6 algorithms)

This paper contains 51 sections, 4 theorems, 17 equations, 11 figures, 7 tables, 6 algorithms.

Key Result

Theorem 4.1

For any $L\ge 2$ and $D \ge 1,$ there exists a position-free decoder $\widetilde{\mathcal{D}}$ that has $L$-layers and embedding dimension $D,$ such that for any encoder $\mathcal{E}$, there exists some input sequence $(x_1, x_2, \hbox{...})$ with $x_1,x_2,\hbox{...} \in\mathbb R^D,$ and $\mathcal{T

Figures (11)

  • Figure 1: Decoder-only vs. Encoder-only Transformers in next token prediction. Decoders use causal attention, ensuring that each token attends only to the preceding tokens. In contrast, encoders allow all tokens to attend to each other by performing attention computation from scratch for each token prediction.
  • Figure 2: An example of a sequence used in a $\operatorname{Count3}$ experiment.
  • Figure 3: Training loss (left) and sequence accuracy curve (right) for the $\operatorname{Count3}$. ENTP successfully learns to perform the $\operatorname{Count3}$ task, but the decoder-only Transformers and prefix Transformers struggle to learn it.
  • Figure 4: Results of LLM fine-tuning on $\operatorname{Count3}$.
  • Figure 5: Addition Sample Complexity. The train and test datasets include numbers with up to 3 digits.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Theorem 4.1
  • Theorem 4.2
  • Conjecture 6.1
  • Lemma 6.2
  • proof
  • Lemma 6.3
  • Remark 6.4
  • proof
  • proof
  • proof
  • ...and 2 more