Table of Contents
Fetching ...

Mixture of Attentions For Speculative Decoding

Matthieu Zimmer, Milan Gritta, Gerasimos Lampouras, Haitham Bou Ammar, Jun Wang

TL;DR

Speculative decoding accelerates LLM generation by drafting with a smaller model and verifying with a larger one, but suffers from partial observability and non-on-policy training. The paper proposes Mixture of Attentions, integrating Layer Self-Attention, Cross-Attention, and Target Layer Inference to create more on-policy and information-rich drafts that effectively utilize Large model activations. It achieves state-of-the-art single-device speedups (up to 9.5% faster than EAGLE-2 with up to 25% higher acceptance) and demonstrates a viable client-server deployment with minimal server calls and resilience to disconnection. The approach also offers a practical framework for edge LLM serving with privacy-preserving client-side computation and server-assisted verification, suitable for real-world deployment.

Abstract

The growth in the number of parameters of Large Language Models (LLMs) has led to a significant surge in computational requirements, making them challenging and costly to deploy. Speculative decoding (SD) leverages smaller models to efficiently propose future tokens, which are then verified by the LLM in parallel. Small models that utilise activations from the LLM currently achieve the fastest decoding speeds. However, we identify several limitations of SD models including the lack of on-policyness during training and partial observability. To address these shortcomings, we propose a more grounded architecture for small models by introducing a Mixture of Attentions for SD. Our novel architecture can be applied in two scenarios: a conventional single device deployment and a novel client-server deployment where the small model is hosted on a consumer device and the LLM on a server. In a single-device scenario, we demonstrate state-of-the-art speedups improving EAGLE-2 by 9.5% and its acceptance length by 25%. In a client-server setting, our experiments demonstrate: 1) state-of-the-art latencies with minimal calls to the server for different network conditions, and 2) in the event of a complete disconnection, our approach can maintain higher accuracy compared to other SD methods and demonstrates advantages over API calls to LLMs, which would otherwise be unable to continue the generation process.

Mixture of Attentions For Speculative Decoding

TL;DR

Speculative decoding accelerates LLM generation by drafting with a smaller model and verifying with a larger one, but suffers from partial observability and non-on-policy training. The paper proposes Mixture of Attentions, integrating Layer Self-Attention, Cross-Attention, and Target Layer Inference to create more on-policy and information-rich drafts that effectively utilize Large model activations. It achieves state-of-the-art single-device speedups (up to 9.5% faster than EAGLE-2 with up to 25% higher acceptance) and demonstrates a viable client-server deployment with minimal server calls and resilience to disconnection. The approach also offers a practical framework for edge LLM serving with privacy-preserving client-side computation and server-assisted verification, suitable for real-world deployment.

Abstract

The growth in the number of parameters of Large Language Models (LLMs) has led to a significant surge in computational requirements, making them challenging and costly to deploy. Speculative decoding (SD) leverages smaller models to efficiently propose future tokens, which are then verified by the LLM in parallel. Small models that utilise activations from the LLM currently achieve the fastest decoding speeds. However, we identify several limitations of SD models including the lack of on-policyness during training and partial observability. To address these shortcomings, we propose a more grounded architecture for small models by introducing a Mixture of Attentions for SD. Our novel architecture can be applied in two scenarios: a conventional single device deployment and a novel client-server deployment where the small model is hosted on a consumer device and the LLM on a server. In a single-device scenario, we demonstrate state-of-the-art speedups improving EAGLE-2 by 9.5% and its acceptance length by 25%. In a client-server setting, our experiments demonstrate: 1) state-of-the-art latencies with minimal calls to the server for different network conditions, and 2) in the event of a complete disconnection, our approach can maintain higher accuracy compared to other SD methods and demonstrates advantages over API calls to LLMs, which would otherwise be unable to continue the generation process.
Paper Structure (35 sections, 5 equations, 4 figures, 10 tables, 1 algorithm)

This paper contains 35 sections, 5 equations, 4 figures, 10 tables, 1 algorithm.

Figures (4)

  • Figure 1: A schematic overview of the mixture of attentions information flow. Layer Self-Attention and mean aggregation are called only once per drafting cycle, i.e after each verification. New tokens are drafted auto-regressively using Self-Attention, updating only the Cross-Attention layer query.
  • Figure 2: Layer Self-Attention: $\mathcal{M}_{\text{Large}}$ activations are transposed so that attention is computed over the layer dimension in order to aggregate token activations across layers. Self-Attention: The first 3 tokens represent the prompt, speculative decoding starts at token 4. Cross-Attention: Tokens 4 to 7 only attend to the prompt while tokens 8 to 10 attend to the first 7 tokens once $\mathcal{M}_{\text{Large}}$ was called for the second time allowing $\mathcal{M}_{\text{Small}}$ to use the activations from the newly verified tokens.
  • Figure 3: A client-server setting for our mixture of attentions architecture with $N=0$.
  • Figure 4: vLLM inference with continuous batching.