Table of Contents
Fetching ...

LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen, Baotian Hu, Min Zhang

TL;DR

LycheeDecode tackles the memory and latency bottlenecks of long-context LLMs by shifting from layer-wide token sharing to a fine-grained, head-level hybrid attention scheme. It designates a small set of Retrieval Heads to perform full attention and identify critical tokens, while Sparse Heads reuse these tokens for efficient computation, enabled by a differentiable HardKuma-based head specialization and a distillation-plus-sparsity loss. The approach achieves strong generative quality on long-context and reasoning benchmarks, with end-to-end speedups up to 2.7× at 128K context and substantial kernel-level acceleration via a TileLang kernel. By preserving attention head diversity and enabling cooperative token sharing across layers, LycheeDecode provides a practical pathway to scalable, high-quality long-context LLM inference.

Abstract

The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By preserving the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context LLM inference.

LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

TL;DR

LycheeDecode tackles the memory and latency bottlenecks of long-context LLMs by shifting from layer-wide token sharing to a fine-grained, head-level hybrid attention scheme. It designates a small set of Retrieval Heads to perform full attention and identify critical tokens, while Sparse Heads reuse these tokens for efficient computation, enabled by a differentiable HardKuma-based head specialization and a distillation-plus-sparsity loss. The approach achieves strong generative quality on long-context and reasoning benchmarks, with end-to-end speedups up to 2.7× at 128K context and substantial kernel-level acceleration via a TileLang kernel. By preserving attention head diversity and enabling cooperative token sharing across layers, LycheeDecode provides a practical pathway to scalable, high-quality long-context LLM inference.

Abstract

The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By preserving the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context LLM inference.
Paper Structure (44 sections, 20 equations, 15 figures, 5 tables, 2 algorithms)

This paper contains 44 sections, 20 equations, 15 figures, 5 tables, 2 algorithms.

Figures (15)

  • Figure 1: LycheeDecode achieved the best performance and latency. Left: Relative performance comparison of various methods on the Qwen3-8B model across different LongBench datasets. Right: Decoding latency comparison across different context lengths with a single batch.
  • Figure 2: Overlap rate of top-$k$ ($k=5$) attention scores between corresponding heads in adjacent layers. The heatmap illustrates the functional diversity among attention heads. We input prompt Please directly output the final answer based on the given question. Question: In a world containing only squares, circles, and triangles, one shape is defined by having no angles and being perfectly symmetrical from every point on its perimeter. What is the single name of the only shape that fits this description? Answer:, and Llama-3 outputs circle. More cases can be found in Appendix \ref{['sec:more_case']}.
  • Figure 3: Overall framework. Left: During the training phase, each head calculates full attention and sparse attention, weighted by HardKuma sampling values. Right: During inference, the retrieval head calculates the critical tokens set for efficient calculation by the subsequent sparse heads.
  • Figure 4: End-to-End Decoding Latency (TPOT) across various context lengths. LycheeDecode and TidalDecode use a fixed 4096 budget. Note that TidalDecode can only support single batch.
  • Figure 5: Latency comparison of our hybrid head kernel and the FlashAttention-2 kernel across different sparse head ratios, context lengths, and batch sizes.
  • ...and 10 more figures