Table of Contents
Fetching ...

The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving

Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan

TL;DR

This paper addresses the challenge of serving long-context LLMs at scale by introducing the CAP principle, which posits that any optimization can improve at most two of three goals: Context, Accuracy, and Performance. It provides a taxonomy that spans two system layers—model and agent—and maps existing techniques into six CAP types: C, A, P, CA, CP, and AP, with detailed treatment of model memory, positional embeddings, accuracy improvements, speedups via sparsity and linear attention, distributed acceleration, prompt compression, and agent memory. The work highlights concrete methods, trade-offs, and practical metrics, emphasizing how user-perceived measurements influence what counts as meeting a CAP goal. By framing long-context serving through CAP, the survey offers a practical design guide for balancing memory, computation, and quality, and it points to a future where closer co-design of models and hardware may realize a true CAP with minimal compromises. Practically, this framework aids system designers in selecting complementary techniques (e.g., memory + compression or memory + distributed acceleration) to meet user expectations for context length and accuracy within latency and cost constraints, while also guiding research toward better alignment of metrics and evaluation in real-world deployments.

Abstract

We survey the large language model (LLM) serving area to understand the intricate dynamics between cost-efficiency and accuracy, which is magnified by the growing need for longer contextual understanding when deploying models at a massive scale. Our findings reveal that works in this space optimize along three distinct but conflicting goals: improving serving context length (C), improving serving accuracy (A), and improving serving performance (P). Drawing inspiration from the CAP theorem in databases, we propose a CAP principle for LLM serving, which suggests that any optimization can improve at most two of these three goals simultaneously. Our survey categorizes existing works within this framework. We find the definition and continuity of user-perceived measurement metrics are crucial in determining whether a goal has been met, akin to prior CAP databases in the wild. We recognize the CAP principle for LLM serving as a guiding principle, rather than a formal theorem, to inform designers of the inherent and dynamic trade-offs in serving models. As serving accuracy and performance have been extensively studied, this survey focuses on works that extend serving context length and address the resulting challenges.

The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving

TL;DR

This paper addresses the challenge of serving long-context LLMs at scale by introducing the CAP principle, which posits that any optimization can improve at most two of three goals: Context, Accuracy, and Performance. It provides a taxonomy that spans two system layers—model and agent—and maps existing techniques into six CAP types: C, A, P, CA, CP, and AP, with detailed treatment of model memory, positional embeddings, accuracy improvements, speedups via sparsity and linear attention, distributed acceleration, prompt compression, and agent memory. The work highlights concrete methods, trade-offs, and practical metrics, emphasizing how user-perceived measurements influence what counts as meeting a CAP goal. By framing long-context serving through CAP, the survey offers a practical design guide for balancing memory, computation, and quality, and it points to a future where closer co-design of models and hardware may realize a true CAP with minimal compromises. Practically, this framework aids system designers in selecting complementary techniques (e.g., memory + compression or memory + distributed acceleration) to meet user expectations for context length and accuracy within latency and cost constraints, while also guiding research toward better alignment of metrics and evaluation in real-world deployments.

Abstract

We survey the large language model (LLM) serving area to understand the intricate dynamics between cost-efficiency and accuracy, which is magnified by the growing need for longer contextual understanding when deploying models at a massive scale. Our findings reveal that works in this space optimize along three distinct but conflicting goals: improving serving context length (C), improving serving accuracy (A), and improving serving performance (P). Drawing inspiration from the CAP theorem in databases, we propose a CAP principle for LLM serving, which suggests that any optimization can improve at most two of these three goals simultaneously. Our survey categorizes existing works within this framework. We find the definition and continuity of user-perceived measurement metrics are crucial in determining whether a goal has been met, akin to prior CAP databases in the wild. We recognize the CAP principle for LLM serving as a guiding principle, rather than a formal theorem, to inform designers of the inherent and dynamic trade-offs in serving models. As serving accuracy and performance have been extensively studied, this survey focuses on works that extend serving context length and address the resulting challenges.
Paper Structure (17 sections, 5 figures, 7 tables)

This paper contains 17 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: The CAP principle for LLM Serving. C is improving context length, A is improving accuracy, and P is improving serving performance or cost-efficiency in general. It states that any serving optimization can improve at most two of the above three goals.
  • Figure 2: A modern-day LLM serving system commonly has two layers: a model layer, which runs a given LLM model, and an agent layer, which runs LLM-based system applications. PE means Positional Embedding. Quant is short for quantization.
  • Figure 3: Works using sequence parallelism. Gray boxes are not tailored for long-context serving.
  • Figure 4: Efficient SP-attention mechanisms used in the prefilling phase of LLM serving.
  • Figure 5: Dist Attention infiniteLLMloongserve, SP-attention mechanism optimized for the auto-regressive decode phase of LLM serving. In the decode phase, Q length is one and KV is already distributed.