Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model
Biao Zhang, Yong Cheng, Siamak Shakeri, Xinyi Wang, Min Ma, Orhan Firat
TL;DR
The paper re-examines encoder-decoder versus decoder-only LLM architectures from a scaling perspective, introducing RedLLM and comparing it with DecLLM across sizes from ~150M to ~8B parameters using RedPajama V1 pretraining and FLAN instruction tuning. It finds that DecLLM is more compute-efficient during pretraining and exhibits stronger zero-/few-shot performance, but RedLLM achieves comparable scaling, longer-context extrapolation, and markedly better efficiency after finetuning. These results challenge the notion that decoder-only architectures are categorically superior and show that encoder-decoder designs still offer compelling scaling properties and practical advantages in instruction-tuned deployment. The work suggests a broader view of LLM design, highlighting complementary strengths and guiding future exploration of balanced or imbalanced architectures and longer sequence capabilities.
Abstract
Recent large language model (LLM) research has undergone an architectural shift from encoder-decoder modeling to nowadays the dominant decoder-only modeling. This rapid transition, however, comes without a rigorous comparative analysis especially \textit{from the scaling perspective}, raising concerns that the potential of encoder-decoder models may have been overlooked. To fill this gap, we revisit encoder-decoder LLM (RedLLM), enhancing it with recent recipes from decoder-only LLM (DecLLM). We conduct a comprehensive comparison between RedLLM, pretrained with prefix language modeling (LM), and DecLLM, pretrained with causal LM, at different model scales, ranging from $\sim$150M to $\sim$8B. Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for instruction tuning, our experiments show that RedLLM produces compelling scaling properties and surprisingly strong performance. While DecLLM is overall more compute-optimal during pretraining, RedLLM demonstrates comparable scaling and context length extrapolation capabilities. After instruction tuning, RedLLM achieves comparable and even better results on various downstream tasks while enjoying substantially better inference efficiency. We hope our findings could inspire more efforts on re-examining RedLLM, unlocking its potential for developing powerful and efficient LLMs.
