MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference
Myunghyun Rhee, Sookyung Choi, Euiseok Kim, Joonseop Sim, Youngpyo Joo, Hoshik Kim
TL;DR
MoSKA tackles the KV cache bandwidth bottleneck in long-context LLM inference by separating per-request unique KV data from massively shared KV data and transforming the shared-data attention into compute-bound GEMMs via Shared KV Attention. It combines an MoE-inspired sparse routing to select relevant shared chunks with a disaggregated hardware infrastructure that scales shared and unique workloads independently, enabling dramatic throughput gains. On a FP8 Llama 3.1 8B setup with $75\%$ sparsity and large shared contexts ($1\text{M}$-$16\text{M}$ tokens) alongside $64\text{K}$ unique tokens, MoSKA achieves up to $538.7\\times$ throughput over baselines. The Universal MoSKA vision, supported by position-independent KV caching (EPIC2410), aims to enable dynamic, composable knowledge contexts from a modular KV library for scalable, context-rich AI systems.
Abstract
The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of Shared KV Attention (MoSKA), an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of MoSKA is a novel Shared KV Attention mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an MoE-inspired sparse attention strategy that prunes the search space and a tailored Disaggregated Infrastructure that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7x over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference.
