Table of Contents
Fetching ...

Cross-Attention Speculative Decoding

Wei Zhong, Manasa Bharadwaj, Yixiao Wang, Yipeng Ji, Chul Lee

TL;DR

This work targets the efficiency of speculative decoding (SD) for large language models by addressing the architectural and training-cost challenges of state-of-the-art SD methods. It introduces Budget Beagle, a cross-attention-based SD model that forgoes pooling and auxiliary layers in favor of a single cross-attention block, paired with a Two-Stage Block-Attention Training regimen that first encourages multi-token representations via inverse block masking and then simulates SD inference to align training with inference dynamics. Empirical results across multiple 7B-scale models show Beagle achieving competitive inference speedups with significantly lower extra memory than EAGLE-v2, aided by stable memory usage during training-time simulation and improved training efficiency from the early multi-token stage. Overall, Beagle provides a simpler, more generalizable SD alternative with practical efficiency gains, opening avenues for applying cross-attention speculative decoding to additional domains and model families.

Abstract

Speculative decoding (SD) is a widely adopted approach for accelerating inference in large language models (LLMs), particularly when the draft and target models are well aligned. However, state-of-the-art SD methods typically rely on tightly coupled, self-attention-based Transformer decoders, often augmented with auxiliary pooling or fusion layers. This coupling makes them increasingly complex and harder to generalize across different models. We present Budget EAGLE (Beagle), the first, to our knowledge, cross-attention-based Transformer decoder SD model that achieves performance on par with leading self-attention SD models (EAGLE-v2) while eliminating the need for pooling or auxiliary components, simplifying the architecture, improving training efficiency, and maintaining stable memory usage during training-time simulation. To enable effective training of this novel architecture, we propose Two-Stage Block-Attention Training, a new method that achieves training stability and convergence efficiency in block-level attention scenarios. Extensive experiments across multiple LLMs and datasets show that Beagle achieves competitive inference speedups and higher training efficiency than EAGLE-v2, offering a strong alternative for architectures in speculative decoding.

Cross-Attention Speculative Decoding

TL;DR

This work targets the efficiency of speculative decoding (SD) for large language models by addressing the architectural and training-cost challenges of state-of-the-art SD methods. It introduces Budget Beagle, a cross-attention-based SD model that forgoes pooling and auxiliary layers in favor of a single cross-attention block, paired with a Two-Stage Block-Attention Training regimen that first encourages multi-token representations via inverse block masking and then simulates SD inference to align training with inference dynamics. Empirical results across multiple 7B-scale models show Beagle achieving competitive inference speedups with significantly lower extra memory than EAGLE-v2, aided by stable memory usage during training-time simulation and improved training efficiency from the early multi-token stage. Overall, Beagle provides a simpler, more generalizable SD alternative with practical efficiency gains, opening avenues for applying cross-attention speculative decoding to additional domains and model families.

Abstract

Speculative decoding (SD) is a widely adopted approach for accelerating inference in large language models (LLMs), particularly when the draft and target models are well aligned. However, state-of-the-art SD methods typically rely on tightly coupled, self-attention-based Transformer decoders, often augmented with auxiliary pooling or fusion layers. This coupling makes them increasingly complex and harder to generalize across different models. We present Budget EAGLE (Beagle), the first, to our knowledge, cross-attention-based Transformer decoder SD model that achieves performance on par with leading self-attention SD models (EAGLE-v2) while eliminating the need for pooling or auxiliary components, simplifying the architecture, improving training efficiency, and maintaining stable memory usage during training-time simulation. To enable effective training of this novel architecture, we propose Two-Stage Block-Attention Training, a new method that achieves training stability and convergence efficiency in block-level attention scenarios. Extensive experiments across multiple LLMs and datasets show that Beagle achieves competitive inference speedups and higher training efficiency than EAGLE-v2, offering a strong alternative for architectures in speculative decoding.

Paper Structure

This paper contains 18 sections, 12 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Comparison between EAGLE li2025eagle1li2024eagle2(Left) and our Beagle architecture (Right). Square boxes denote higher-level states; a hat on top indicates states predicted by the draft model. Embedding layers are omitted for clarity, and colored words represent tokens generated at different positions. The right-side trees represent branched prediction via tree attention Miao_specInfercai2024medusa. Using self attention, EAGLE requires auxiliary pooling layers and explicit copying of higher-level states for concatenation. In contrast, Beagle adopts a standard training pipeline without offsets and avoids copying, simplifying draft modeling.
  • Figure 2: The cross-attention masks used during training for draft model heads. Left one (early stage block attention): Query states are derived directly from token embeddings, and keys are from high-level states. An inverse block attention starting at a random offset with a fixed window hides local keys from a query, encouraging the model to condense more information on future tokens. Right two (after simulation step 1 and step 2, late stage): In the late-stage training, we unroll newly predicted states to accurately simulate inference during training. Unlike Training-Time Test with self attentions, this method requires no new queries to be generated, and only needs one-step attention memory allocation for in-place adding of next-predicted keys.
  • Figure 3: Early-stage acceptance rates at different draft steps (step-$\alpha$) during the first 10 epoch training process (evaluated on MT-Bench). Our model (Beagle) uses the early-stage loss based on multi-token predictions. At this stage, our training efficiency is consistently better than EAGLE (v1/v2) li2025eagle1li2024eagle2
  • Figure 4: Early-stage token acceptance rates at different positions and corresponding inference speeds (evaluated on MT-Bench). We vary the window length from 1 to 5 for five early-stage training settings. Multi-token prediction (using $\mathcal{L}_{early}$) with a proper window width (optimal width achieved at 3) improves further-step token acceptance rates, generally enhancing inference speeds.
  • Figure 5: The late-stage (10th- to 20th-epoch) draft model prediction accuracy changes using different training losses (the validation set during training is a partial MT-Bench data). The orange lines correspond to the model trained with our proposed late-stage loss $\mathcal{L}_{late}$, and the blue baselines are when the model continues to be trained with early-stage loss $\mathcal{L}_{early}$. Due to the high variance of accuracy changes during late-stage training, we also highlight the interpolated smooth curves.
  • ...and 6 more figures