Table of Contents
Fetching ...

Pancake: Hierarchical Memory System for Multi-Agent LLM Serving

Zhengding Hu, Zaifeng Pan, Prabhleen Kaur, Vibha Murthy, Zhongkai Yu, Yue Guan, Zhen Wang, Steven Swanson, Yufei Ding

TL;DR

Pancake is presented, a multi-tier agentic memory system that unifies three key techniques: multi-level index caching for single agents, coordinated index management across multiple agents, and collaborative GPU-CPU acceleration.

Abstract

In this work, we identify and address the core challenges of agentic memory management in LLM serving, where large-scale storage, frequent updates, and multiple coexisting agents jointly introduce complex and high-cost approximate nearest neighbor (ANN) searching problems. We present Pancake, a multi-tier agentic memory system that unifies three key techniques: (i) multi-level index caching for single agents, (ii) coordinated index management across multiple agents, and (iii) collaborative GPU-CPU acceleration. Pancake exposes easy-to-use interface that can be integrated into memory-based agents like Mem-GPT, and is compatible with agentic frameworks such as LangChain and LlamaIndex. Experiments on realistic agent workloads show that Pancake substantially outperforms existing frameworks, achieving more than 4.29x end-to-end throughput improvement.

Pancake: Hierarchical Memory System for Multi-Agent LLM Serving

TL;DR

Pancake is presented, a multi-tier agentic memory system that unifies three key techniques: multi-level index caching for single agents, coordinated index management across multiple agents, and collaborative GPU-CPU acceleration.

Abstract

In this work, we identify and address the core challenges of agentic memory management in LLM serving, where large-scale storage, frequent updates, and multiple coexisting agents jointly introduce complex and high-cost approximate nearest neighbor (ANN) searching problems. We present Pancake, a multi-tier agentic memory system that unifies three key techniques: (i) multi-level index caching for single agents, (ii) coordinated index management across multiple agents, and (iii) collaborative GPU-CPU acceleration. Pancake exposes easy-to-use interface that can be integrated into memory-based agents like Mem-GPT, and is compatible with agentic frameworks such as LangChain and LlamaIndex. Experiments on realistic agent workloads show that Pancake substantially outperforms existing frameworks, achieving more than 4.29x end-to-end throughput improvement.
Paper Structure (20 sections, 3 equations, 19 figures)

This paper contains 20 sections, 3 equations, 19 figures.

Figures (19)

  • Figure 1: Memory-based workflow of agentic LLMs.
  • Figure 2: An example of multi-agent memory in Pancake.
  • Figure 3: Memory-based agents and their workflows.
  • Figure 4: Direct in-place updates scatter the new vectors into a large number of existing clusters, leading to degradation in efficiency and recall. A naive solution is to leverage intra-agent locality and maintain dedicated clusters for a agent.
  • Figure 5: For more complex workloads, locality across multiple reasoning steps of different requests can be observed. This makes naive dedicated clusters for the agent inefficient, as it fails to capture step-wise clustering.
  • ...and 14 more figures