Table of Contents
Fetching ...

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki Asano, Babak Ehteshami Bejnordi

TL;DR

This work addresses the high cost and latency of autoregressive decoding in large language models by proposing LLM-to-SLM, a hybrid framework where a frozen, high-quality LLM encoder processes the prompt to produce representations that condition a much smaller, trainable SLM for generation. The architecture leverages a lightweight projector to align LLM features with the SLM embedding space, enabling fast autoregressive decoding while preserving much of the LLM's performance. Across translation, summarization, and instruction tuning, the approach achieves substantial speedups (up to $4\times$) with only $1-2\%$ drops in task performance, and it remains flexible across encoder-decoder and decoder-only SLMs. The results demonstrate practical efficiency gains suitable for latency-sensitive applications, with broad potential for combining with other efficiency techniques and extending to larger, decoder-only LLMs in future work.

Abstract

Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following. However, their enormous size and reliance on autoregressive decoding increase deployment costs and complicate their use in latency-critical applications. In this work, we propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding while maintaining high performance. Our method utilizes a pretrained frozen LLM that encodes all prompt tokens once in parallel, and uses the resulting representations to condition and guide a small language model (SLM), which then generates the response more efficiently. We investigate the combination of encoder-decoder LLMs with both encoder-decoder and decoder-only SLMs from different model families and only require fine-tuning of the SLM. Experiments with various benchmarks show substantial speedups of up to $4\times$, with minor performance penalties of $1-2\%$ for translation and summarization tasks compared to the LLM.

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

TL;DR

This work addresses the high cost and latency of autoregressive decoding in large language models by proposing LLM-to-SLM, a hybrid framework where a frozen, high-quality LLM encoder processes the prompt to produce representations that condition a much smaller, trainable SLM for generation. The architecture leverages a lightweight projector to align LLM features with the SLM embedding space, enabling fast autoregressive decoding while preserving much of the LLM's performance. Across translation, summarization, and instruction tuning, the approach achieves substantial speedups (up to ) with only drops in task performance, and it remains flexible across encoder-decoder and decoder-only SLMs. The results demonstrate practical efficiency gains suitable for latency-sensitive applications, with broad potential for combining with other efficiency techniques and extending to larger, decoder-only LLMs in future work.

Abstract

Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following. However, their enormous size and reliance on autoregressive decoding increase deployment costs and complicate their use in latency-critical applications. In this work, we propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding while maintaining high performance. Our method utilizes a pretrained frozen LLM that encodes all prompt tokens once in parallel, and uses the resulting representations to condition and guide a small language model (SLM), which then generates the response more efficiently. We investigate the combination of encoder-decoder LLMs with both encoder-decoder and decoder-only SLMs from different model families and only require fine-tuning of the SLM. Experiments with various benchmarks show substantial speedups of up to , with minor performance penalties of for translation and summarization tasks compared to the LLM.
Paper Structure (38 sections, 2 equations, 6 figures, 13 tables)

This paper contains 38 sections, 2 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: LLM-to-SLM: A large language model (LLM) computes a high-quality representation of the prompt to condition a small language model (SLM), which then efficiently decodes the response while maintaining high performance close to the LLM.
  • Figure 2: Architecture details. A frozen LLM encoder integrates projected representations into either a trainable encoder-decoder or a decoder-only SLM.
  • Figure 3: Performance-runtime trade-off curves for various models across different tasks.
  • Figure 4: Performance-runtime comparison with tiny GPT2 versions ($d$ indicates maximum depth) as SLMs for machine translation. The y-axis shows the average BLEU score across languages. T5 Large$\,\to\,$GPT2 with only 4 layers outperforms GPT2. The smaller the SLM, the greater the gap to our LLM-to-SLM models.
  • Figure 5: Runtime for LLM, SLM and LLM $\rightarrow$ SLM with varying generation lengths.
  • ...and 1 more figures