Table of Contents
Fetching ...

Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model

Biao Zhang, Yong Cheng, Siamak Shakeri, Xinyi Wang, Min Ma, Orhan Firat

TL;DR

The paper re-examines encoder-decoder versus decoder-only LLM architectures from a scaling perspective, introducing RedLLM and comparing it with DecLLM across sizes from ~150M to ~8B parameters using RedPajama V1 pretraining and FLAN instruction tuning. It finds that DecLLM is more compute-efficient during pretraining and exhibits stronger zero-/few-shot performance, but RedLLM achieves comparable scaling, longer-context extrapolation, and markedly better efficiency after finetuning. These results challenge the notion that decoder-only architectures are categorically superior and show that encoder-decoder designs still offer compelling scaling properties and practical advantages in instruction-tuned deployment. The work suggests a broader view of LLM design, highlighting complementary strengths and guiding future exploration of balanced or imbalanced architectures and longer sequence capabilities.

Abstract

Recent large language model (LLM) research has undergone an architectural shift from encoder-decoder modeling to nowadays the dominant decoder-only modeling. This rapid transition, however, comes without a rigorous comparative analysis especially \textit{from the scaling perspective}, raising concerns that the potential of encoder-decoder models may have been overlooked. To fill this gap, we revisit encoder-decoder LLM (RedLLM), enhancing it with recent recipes from decoder-only LLM (DecLLM). We conduct a comprehensive comparison between RedLLM, pretrained with prefix language modeling (LM), and DecLLM, pretrained with causal LM, at different model scales, ranging from $\sim$150M to $\sim$8B. Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for instruction tuning, our experiments show that RedLLM produces compelling scaling properties and surprisingly strong performance. While DecLLM is overall more compute-optimal during pretraining, RedLLM demonstrates comparable scaling and context length extrapolation capabilities. After instruction tuning, RedLLM achieves comparable and even better results on various downstream tasks while enjoying substantially better inference efficiency. We hope our findings could inspire more efforts on re-examining RedLLM, unlocking its potential for developing powerful and efficient LLMs.

Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model

TL;DR

The paper re-examines encoder-decoder versus decoder-only LLM architectures from a scaling perspective, introducing RedLLM and comparing it with DecLLM across sizes from ~150M to ~8B parameters using RedPajama V1 pretraining and FLAN instruction tuning. It finds that DecLLM is more compute-efficient during pretraining and exhibits stronger zero-/few-shot performance, but RedLLM achieves comparable scaling, longer-context extrapolation, and markedly better efficiency after finetuning. These results challenge the notion that decoder-only architectures are categorically superior and show that encoder-decoder designs still offer compelling scaling properties and practical advantages in instruction-tuned deployment. The work suggests a broader view of LLM design, highlighting complementary strengths and guiding future exploration of balanced or imbalanced architectures and longer sequence capabilities.

Abstract

Recent large language model (LLM) research has undergone an architectural shift from encoder-decoder modeling to nowadays the dominant decoder-only modeling. This rapid transition, however, comes without a rigorous comparative analysis especially \textit{from the scaling perspective}, raising concerns that the potential of encoder-decoder models may have been overlooked. To fill this gap, we revisit encoder-decoder LLM (RedLLM), enhancing it with recent recipes from decoder-only LLM (DecLLM). We conduct a comprehensive comparison between RedLLM, pretrained with prefix language modeling (LM), and DecLLM, pretrained with causal LM, at different model scales, ranging from 150M to 8B. Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for instruction tuning, our experiments show that RedLLM produces compelling scaling properties and surprisingly strong performance. While DecLLM is overall more compute-optimal during pretraining, RedLLM demonstrates comparable scaling and context length extrapolation capabilities. After instruction tuning, RedLLM achieves comparable and even better results on various downstream tasks while enjoying substantially better inference efficiency. We hope our findings could inspire more efforts on re-examining RedLLM, unlocking its potential for developing powerful and efficient LLMs.

Paper Structure

This paper contains 19 sections, 2 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Overview of model architecture and specification for RedLLM and DecLLM. We use red, blue and gray to denote input tokens, output tokens, and positions, respectively. For RedLLM, we apply rotary embedding to all attentions (encoder/decoder self-attention and cross-attention) with continuous positions, i.e. decoder position continues from the last one in the encoder. We adopt prefix language modeling for pretraining, and apply layer normalization to query (Q), key (K), value (V), and attention output to improve stabilization.
  • Figure 2: Fitted scaling law on in-domain dataset (RedPajama) for RedLLM and DecLLM. Left: training Flops ($C$); Right: model parameters ($N$). To ensure fair comparison, we evaluate PPL using a prefix LM approach, where we utilize the first 1024 tokens as a prefix and compute log-likelihood on the subsequent 1024 tokens.
  • Figure 3: PPL as a function of total training compute. Models are evaluated using the same prefix LM approach over different pretraining checkpoints. The compute-optimal frontier is mostly dominated by DecLLM, especially with larger compute budget.
  • Figure 4: Zero- and few-shot pretraining performance over training steps.
  • Figure 5: PPL curves for length extrapolation on in-domain dataset (RedPajama). We follow the prefix LM evaluation approach and explore different prefix lengths (1, 512 and 1024).
  • ...and 11 more figures