Table of Contents
Fetching ...

AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference

Yangshen Deng, Zhengxin You, Long Xiang, Qilong Li, Peiqi Yuan, Zhaoyang Hong, Yitao Zheng, Wanting Li, Runzhong Li, Haotian Liu, Kyriakos Mouratidis, Man Lung Yiu, Huan Li, Qiaomu Shen, Rui Mao, Bo Tang

TL;DR

AlayaDB tackles the inefficiency of long-context LLM inference by decoupling KV cache management and attention computation from the inference engine and embedding them into a native vector database. It introduces Dynamic Inner Product Range Query (DIPR) and its DIPRS processing to adaptively select the number of critical keys/values per head and task, paired with a rule-based optimizer to balance top-$k$, DIPR, and filter queries. The system employs index-sharing, GPU-accelerated kNN, late-materialization, and a SPDK-based vector file system with a data-centric attention engine, achieving low Time-To-First-Token and competitive Time-Per-Output-Token while maintaining high generation quality. Empirical results across 8 long-context tasks show DIPRS often outperforms traditional sparse attention and full-attention baselines under realistic SLOs, with significant gains in throughput, memory efficiency, and end-to-end latency, demonstrating practical impact for MaaS and other long-context LLM applications.

Abstract

AlayaDB is a cutting-edge vector database system natively architected for efficient and effective long-context inference for Large Language Models (LLMs) at AlayaDB AI. Specifically, it decouples the KV cache and attention computation from the LLM inference systems, and encapsulates them into a novel vector database system. For the Model as a Service providers (MaaS), AlayaDB consumes fewer hardware resources and offers higher generation quality for various workloads with different kinds of Service Level Objectives (SLOs), when comparing with the existing alternative solutions (e.g., KV cache disaggregation, retrieval-based sparse attention). The crux of AlayaDB is that it abstracts the attention computation and cache management for LLM inference into a query processing procedure, and optimizes the performance via a native query optimizer. In this work, we demonstrate the effectiveness of AlayaDB via (i) three use cases from our industry partners, and (ii) extensive experimental results on LLM inference benchmarks.

AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference

TL;DR

AlayaDB tackles the inefficiency of long-context LLM inference by decoupling KV cache management and attention computation from the inference engine and embedding them into a native vector database. It introduces Dynamic Inner Product Range Query (DIPR) and its DIPRS processing to adaptively select the number of critical keys/values per head and task, paired with a rule-based optimizer to balance top-, DIPR, and filter queries. The system employs index-sharing, GPU-accelerated kNN, late-materialization, and a SPDK-based vector file system with a data-centric attention engine, achieving low Time-To-First-Token and competitive Time-Per-Output-Token while maintaining high generation quality. Empirical results across 8 long-context tasks show DIPRS often outperforms traditional sparse attention and full-attention baselines under realistic SLOs, with significant gains in throughput, memory efficiency, and end-to-end latency, demonstrating practical impact for MaaS and other long-context LLM applications.

Abstract

AlayaDB is a cutting-edge vector database system natively architected for efficient and effective long-context inference for Large Language Models (LLMs) at AlayaDB AI. Specifically, it decouples the KV cache and attention computation from the LLM inference systems, and encapsulates them into a novel vector database system. For the Model as a Service providers (MaaS), AlayaDB consumes fewer hardware resources and offers higher generation quality for various workloads with different kinds of Service Level Objectives (SLOs), when comparing with the existing alternative solutions (e.g., KV cache disaggregation, retrieval-based sparse attention). The crux of AlayaDB is that it abstracts the attention computation and cache management for LLM inference into a query processing procedure, and optimizes the performance via a native query optimizer. In this work, we demonstrate the effectiveness of AlayaDB via (i) three use cases from our industry partners, and (ii) extensive experimental results on LLM inference benchmarks.

Paper Structure

This paper contains 25 sections, 1 theorem, 1 equation, 10 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

The critical token in Definition def:critical_token is equivalent to the inner product-based critical token in Definition def:ip_critical_token.

Figures (10)

  • Figure 1: Summary of LLM inference solutions
  • Figure 2: System overview of AlayaDB
  • Figure 3: Using AlayaDB APIs for LLM inference
  • Figure 4: The number of selected tokens in different heads
  • Figure 5: The number of critical tokens in different tasks
  • ...and 5 more figures

Theorems & Definitions (4)

  • Definition 1: Critical token
  • Definition 2: Inner Product-based Critical token
  • Theorem 1
  • Definition 3: Dynamic Inner-Product Range Query, DIPR($\bm{q},\beta$)