Table of Contents
Fetching ...

LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing

Dongfang Li, Zixuan Liu, Gang Lin, Baotian Hu, Min Zhang

TL;DR

This design transforms cache retrieval from a linear scan into a theoretically bounded, logarithmic-time pruning process, while a lazy update strategy supports efficient streaming generation and preserves local semantic coherence via boundary-aware chunking.

Abstract

The quadratic complexity of the attention mechanism and the substantial memory footprint of the Key-Value (KV) cache present severe computational and memory challenges for Large Language Models (LLMs) processing long contexts. Existing retrieval-based methods often compromise semantic integrity through fixed-size chunking and suffer from inefficient linear scanning. In this paper, we propose LycheeCluster, a novel method for efficient KV cache management. LycheeCluster preserves local semantic coherence via boundary-aware chunking and constructs a recursive hierarchical index rooted in the triangle inequality. This design transforms cache retrieval from a linear scan into a theoretically bounded, logarithmic-time pruning process, while a lazy update strategy supports efficient streaming generation. Experiments demonstrate that LycheeCluster achieves up to a 3.6x end-to-end inference speedup with negligible degradation in model performance, outperforming state-of-the-art KV cache management methods (e.g., Quest, ClusterKV). We will release our code and kernels after publication.

LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing

TL;DR

This design transforms cache retrieval from a linear scan into a theoretically bounded, logarithmic-time pruning process, while a lazy update strategy supports efficient streaming generation and preserves local semantic coherence via boundary-aware chunking.

Abstract

The quadratic complexity of the attention mechanism and the substantial memory footprint of the Key-Value (KV) cache present severe computational and memory challenges for Large Language Models (LLMs) processing long contexts. Existing retrieval-based methods often compromise semantic integrity through fixed-size chunking and suffer from inefficient linear scanning. In this paper, we propose LycheeCluster, a novel method for efficient KV cache management. LycheeCluster preserves local semantic coherence via boundary-aware chunking and constructs a recursive hierarchical index rooted in the triangle inequality. This design transforms cache retrieval from a linear scan into a theoretically bounded, logarithmic-time pruning process, while a lazy update strategy supports efficient streaming generation. Experiments demonstrate that LycheeCluster achieves up to a 3.6x end-to-end inference speedup with negligible degradation in model performance, outperforming state-of-the-art KV cache management methods (e.g., Quest, ClusterKV). We will release our code and kernels after publication.
Paper Structure (47 sections, 2 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 47 sections, 2 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: Impact of retrieval granularity on semantic integrity. We illustrate the limitations of existing methods using a JSON retrieval query. (Left) Quest employs fixed-size pages that arbitrarily sever semantic boundaries. (Middle) ClusterKV uses token-level clustering that scatters locally coherent tokens into disjoint clusters based on vector variance. (Right) LycheeCluster (Ours) preserves the complete semantic unit via structure-aware chunking, ensuring precise retrieval.
  • Figure 2: Pilot Study on StrucText-Eval. We compare the standard Quest (fixed page) with a modified version using structure-aware chunks while keeping the scoring metric identical. The significant accuracy gain (e.g., +15.0% on JSON) confirms that preserving semantic integrity is a prerequisite for effective retrieval.
  • Figure 3: The overall pipeline of LycheeCluster. The left panel illustrates the bottom-up index construction during the prefill phase, where variable-length chunks are hierarchically clustered. The right panel demonstrates the top-down retrieval and incremental update during the decoding phase.
  • Figure 4: End-to-end decoding latency (TPOT) comparison on a single H20 GPU across varying context lengths. While full attention exhibits linear latency growth, our method maintains consistently low latency.
  • Figure 5: Kernel-level latency breakdown. (a) Prefill Phase: Latency comparison across varying context lengths. The colored top sections represent the index construction overhead. While LycheeCluster incurs a slightly higher construction cost (10–15%) than ClusterKV, it remains a minor fraction of the total prefill time. (b) Decoding Phase: Breakdown of total latency for generating 1,024 tokens at 72k context. The combined overhead of retrieval and lazy updates in LycheeCluster is minimal compared to the massive computation reduction.
  • ...and 7 more figures