Cross-Attention Speculative Decoding

Wei Zhong; Manasa Bharadwaj; Yixiao Wang; Yipeng Ji; Chul Lee

Cross-Attention Speculative Decoding

Wei Zhong, Manasa Bharadwaj, Yixiao Wang, Yipeng Ji, Chul Lee

TL;DR

This work targets the efficiency of speculative decoding (SD) for large language models by addressing the architectural and training-cost challenges of state-of-the-art SD methods. It introduces Budget Beagle, a cross-attention-based SD model that forgoes pooling and auxiliary layers in favor of a single cross-attention block, paired with a Two-Stage Block-Attention Training regimen that first encourages multi-token representations via inverse block masking and then simulates SD inference to align training with inference dynamics. Empirical results across multiple 7B-scale models show Beagle achieving competitive inference speedups with significantly lower extra memory than EAGLE-v2, aided by stable memory usage during training-time simulation and improved training efficiency from the early multi-token stage. Overall, Beagle provides a simpler, more generalizable SD alternative with practical efficiency gains, opening avenues for applying cross-attention speculative decoding to additional domains and model families.

Abstract

Speculative decoding (SD) is a widely adopted approach for accelerating inference in large language models (LLMs), particularly when the draft and target models are well aligned. However, state-of-the-art SD methods typically rely on tightly coupled, self-attention-based Transformer decoders, often augmented with auxiliary pooling or fusion layers. This coupling makes them increasingly complex and harder to generalize across different models. We present Budget EAGLE (Beagle), the first, to our knowledge, cross-attention-based Transformer decoder SD model that achieves performance on par with leading self-attention SD models (EAGLE-v2) while eliminating the need for pooling or auxiliary components, simplifying the architecture, improving training efficiency, and maintaining stable memory usage during training-time simulation. To enable effective training of this novel architecture, we propose Two-Stage Block-Attention Training, a new method that achieves training stability and convergence efficiency in block-level attention scenarios. Extensive experiments across multiple LLMs and datasets show that Beagle achieves competitive inference speedups and higher training efficiency than EAGLE-v2, offering a strong alternative for architectures in speculative decoding.

Cross-Attention Speculative Decoding

TL;DR

Abstract

Cross-Attention Speculative Decoding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)