ENTP: Encoder-only Next Token Prediction
Ethan Ewer, Daewon Chae, Thomas Zeng, Jinkyu Kim, Kangwook Lee
TL;DR
ENTP investigates encoder-only Transformers for next-token prediction as an alternative to standard decoder-only models, arguing that full attention with recomputation can yield greater expressive power under abundant compute. The work provides theoretical results showing encoder- and decoder-only architectures express different causal function classes, backed by the Count3 task where ENTP succeeds while decoders fail. Empirically, ENTP demonstrates superior or competitive performance on addition, in-context learning, and OpenWebText-style language modeling, though it remains computationally expensive. The findings suggest new directions for architecture design that balance expressiveness and efficiency, potentially guiding future compute-aware language model development.
Abstract
Next-token prediction is conventionally done using decoder-only Transformers with causal attention, as this approach allows for efficient reuse of keys and values. What if we were not compute-limited, should we still use decoder-only Transformers? In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP in settings with unbounded compute. We introduce the $\operatorname{Count3}$ task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate the superior performance of ENTP across representative tasks where next-token prediction based Transformers can be evaluated, including addition, in-context learning, and language modeling.
