Table of Contents
Fetching ...

Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation

Biao Zhang, Fedor Moiseev, Joshua Ainslie, Paul Suganthan, Min Ma, Surya Bhupatiraju, Fede Lebron, Orhan Firat, Armand Joulin, Zhe Dong

TL;DR

This work investigates adapting pretrained decoder-only LLMs into encoder-decoder models to achieve a better quality-efficiency trade-off. The authors propose architectural and optimization strategies, including cross-attention initialization, warmup schemes, and two pretraining objectives (PrefixLM and UL2), evaluated on Gemma 2 and mT5-sized models. They show that encoder-decoder adaptation yields comparable or better pretraining performance and substantially improves finetuning outcomes, particularly after instruction tuning, while maintaining favorable inference efficiency. They also explore model pairing (e.g., 9B-2B) and demonstrate stronger representations on SuperGLUE, with plans to release checkpoints for future research and guidelines for ongoing development.

Abstract

While decoder-only large language models (LLMs) have shown impressive results, encoder-decoder models are still widely adopted in real-world applications for their inference efficiency and richer encoder representation. In this paper, we study a novel problem: adapting pretrained decoder-only LLMs to encoder-decoder, with the goal of leveraging the strengths of both approaches to achieve a more favorable quality-efficiency trade-off. We argue that adaptation not only enables inheriting the capability of decoder-only LLMs but also reduces the demand for computation compared to pretraining from scratch. We rigorously explore different pretraining objectives and parameter initialization/optimization techniques. Through extensive experiments based on Gemma 2 (2B and 9B) and a suite of newly pretrained mT5-sized models (up to 1.6B), we demonstrate the effectiveness of adaptation and the advantage of encoder-decoder LLMs. Under similar inference budget, encoder-decoder LLMs achieve comparable (often better) pretraining performance but substantially better finetuning performance than their decoder-only counterpart. For example, Gemma 2B-2B outperforms Gemma 2B by $\sim$7\% after instruction tuning. Encoder-decoder adaptation also allows for flexible combination of different-sized models, where Gemma 9B-2B significantly surpasses Gemma 2B-2B by $>$3\%. The adapted encoder representation also yields better results on SuperGLUE. We will release our checkpoints to facilitate future research.

Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation

TL;DR

This work investigates adapting pretrained decoder-only LLMs into encoder-decoder models to achieve a better quality-efficiency trade-off. The authors propose architectural and optimization strategies, including cross-attention initialization, warmup schemes, and two pretraining objectives (PrefixLM and UL2), evaluated on Gemma 2 and mT5-sized models. They show that encoder-decoder adaptation yields comparable or better pretraining performance and substantially improves finetuning outcomes, particularly after instruction tuning, while maintaining favorable inference efficiency. They also explore model pairing (e.g., 9B-2B) and demonstrate stronger representations on SuperGLUE, with plans to release checkpoints for future research and guidelines for ongoing development.

Abstract

While decoder-only large language models (LLMs) have shown impressive results, encoder-decoder models are still widely adopted in real-world applications for their inference efficiency and richer encoder representation. In this paper, we study a novel problem: adapting pretrained decoder-only LLMs to encoder-decoder, with the goal of leveraging the strengths of both approaches to achieve a more favorable quality-efficiency trade-off. We argue that adaptation not only enables inheriting the capability of decoder-only LLMs but also reduces the demand for computation compared to pretraining from scratch. We rigorously explore different pretraining objectives and parameter initialization/optimization techniques. Through extensive experiments based on Gemma 2 (2B and 9B) and a suite of newly pretrained mT5-sized models (up to 1.6B), we demonstrate the effectiveness of adaptation and the advantage of encoder-decoder LLMs. Under similar inference budget, encoder-decoder LLMs achieve comparable (often better) pretraining performance but substantially better finetuning performance than their decoder-only counterpart. For example, Gemma 2B-2B outperforms Gemma 2B by 7\% after instruction tuning. Encoder-decoder adaptation also allows for flexible combination of different-sized models, where Gemma 9B-2B significantly surpasses Gemma 2B-2B by 3\%. The adapted encoder representation also yields better results on SuperGLUE. We will release our checkpoints to facilitate future research.

Paper Structure

This paper contains 24 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of our approach. We build encoder-decoder models by adapting from pretrained decoder-only models. Model architecture and parameters are inherited from the decoder-only model except the cross-attention, for which we adopt different initialization methods depending on the encoder and decoder size. "ROPE": rotary embedding; "FFN": feed-forward layer.
  • Figure 2: Pretraining performance as a function of the number of pretrained tokens during the adaptation.
  • Figure 3: Comparisons of decoder-only LLMs with adapted encoder-decoder models under inference flops. We show PT, IT, and SuperGLUE performance. Inference flops is estimated with a sequence length of 4096-4096 and 8192 for encoder-decoder and decoder-only LLMs, respectively. Note the upper left corner marks the quality-efficiency frontier.
  • Figure 4: GSM8K performance as a function of latency for RLHFed models. Latency is estimated as milliseconds (ms) per query by answering 200 reasoning questions from GSM8K. Batch size of 1 is used.
  • Figure 5: Quality change for the two-stage optimization. "UL2-then-PrefixLM": switch the training objective from UL2 to PrefixLM for the final 10% tokens; "PrefixLM-then-UL2": similar but from PrefixLM to UL2.
  • ...and 1 more figures