Table of Contents
Fetching ...

MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference

Myunghyun Rhee, Sookyung Choi, Euiseok Kim, Joonseop Sim, Youngpyo Joo, Hoshik Kim

TL;DR

MoSKA tackles the KV cache bandwidth bottleneck in long-context LLM inference by separating per-request unique KV data from massively shared KV data and transforming the shared-data attention into compute-bound GEMMs via Shared KV Attention. It combines an MoE-inspired sparse routing to select relevant shared chunks with a disaggregated hardware infrastructure that scales shared and unique workloads independently, enabling dramatic throughput gains. On a FP8 Llama 3.1 8B setup with $75\%$ sparsity and large shared contexts ($1\text{M}$-$16\text{M}$ tokens) alongside $64\text{K}$ unique tokens, MoSKA achieves up to $538.7\\times$ throughput over baselines. The Universal MoSKA vision, supported by position-independent KV caching (EPIC2410), aims to enable dynamic, composable knowledge contexts from a modular KV library for scalable, context-rich AI systems.

Abstract

The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of Shared KV Attention (MoSKA), an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of MoSKA is a novel Shared KV Attention mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an MoE-inspired sparse attention strategy that prunes the search space and a tailored Disaggregated Infrastructure that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7x over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference.

MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference

TL;DR

MoSKA tackles the KV cache bandwidth bottleneck in long-context LLM inference by separating per-request unique KV data from massively shared KV data and transforming the shared-data attention into compute-bound GEMMs via Shared KV Attention. It combines an MoE-inspired sparse routing to select relevant shared chunks with a disaggregated hardware infrastructure that scales shared and unique workloads independently, enabling dramatic throughput gains. On a FP8 Llama 3.1 8B setup with sparsity and large shared contexts (- tokens) alongside unique tokens, MoSKA achieves up to throughput over baselines. The Universal MoSKA vision, supported by position-independent KV caching (EPIC2410), aims to enable dynamic, composable knowledge contexts from a modular KV library for scalable, context-rich AI systems.

Abstract

The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of Shared KV Attention (MoSKA), an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of MoSKA is a novel Shared KV Attention mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an MoE-inspired sparse attention strategy that prunes the search space and a tailored Disaggregated Infrastructure that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7x over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference.

Paper Structure

This paper contains 14 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Hardware Requirement Challenges. (a) shows that even with significant optimizations (GQA, Sparse attention, Quantization) with widely-used optimization levels, KV cache size still scales with sequence length and batch size. (b) illustrates that while sharing the KV cache solves the memory capacity scaling, memory bandwidth requirements still scale with the batch size. MoSKA's Shared KV Attention is designed to solve this remaining bandwidth scaling problem.
  • Figure 2: The MoSKA Architecture. detailing its core mechanism and high-level structure. (a) illustrates the fundamental principle of Shared KV Attention: concurrent queries to identical shared data are batched into a single, compute-bound GEMM operation, contrasting with the memory-bound GEMV operations used for unique KV data. Building on this, (b) depicts the complete MoSKA system, where an MoE-inspired router first selects a sparse subset of relevant shared KV chunks ('Experts'), ensuring both computational efficiency and scalability over massive shared contexts.
  • Figure 3: Proposed disaggregated LLM serving infrastructure for MoSKA, separating FFN/Unique KV Attention Nodes from specialized Shared KV Attention Nodes.
  • Figure 4: Batch scaling capability and normalized throughput.
  • Figure 5: MFU and memory utilization of each node.