Table of Contents
Fetching ...

KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning

Zebin Yang, Tong Xie, Baotong Lu, Shaoshan Liu, Bo Yu, Meng Li

TL;DR

KEEP is a KV-cache-centric memory management system for efficient embodied planning that combines a Static-Dynamic Memory Construction algorithm that reduces KV cache recomputation by mixed-granularity memory group, a Multi-hop Memory Re-computation algorithm that dynamically identifies important cross-attention among different memory groups and reconstructs memory interactions iteratively, and a Layer-balanced Memory Loading that eliminates unbalanced KV cache loading and cross-attention computation across different layers.

Abstract

Memory-augmented Large Language Models (LLMs) have demonstrated remarkable capability for complex and long-horizon embodied planning. By keeping track of past experiences and environmental states, memory enables LLMs to maintain a global view, thereby avoiding repetitive exploration. However, existing approaches often store the memory as raw text, leading to excessively long prompts and high prefill latency. While it is possible to store and reuse the KV caches, the efficiency benefits are greatly undermined due to frequent KV cache updates. In this paper, we propose KEEP, a KV-cache-centric memory management system for efficient embodied planning. KEEP features 3 key innovations: (1) a Static-Dynamic Memory Construction algorithm that reduces KV cache recomputation by mixed-granularity memory group; (2) a Multi-hop Memory Re-computation algorithm that dynamically identifies important cross-attention among different memory groups and reconstructs memory interactions iteratively; (3) a Layer-balanced Memory Loading that eliminates unbalanced KV cache loading and cross-attention computation across different layers. Extensive experimental results have demonstrated that KEEP achieves 2.68x speedup with negligible accuracy loss compared with text-based memory methods on ALFRED dataset. Compared with the KV re-computation method CacheBlend (EuroSys'25), KEEP shows 4.13% success rate improvement and 1.90x time-to-first-token (TTFT) reduction. Our code is available on https://github.com/PKU-SEC-Lab/KEEP_Embodied_Memory.

KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning

TL;DR

KEEP is a KV-cache-centric memory management system for efficient embodied planning that combines a Static-Dynamic Memory Construction algorithm that reduces KV cache recomputation by mixed-granularity memory group, a Multi-hop Memory Re-computation algorithm that dynamically identifies important cross-attention among different memory groups and reconstructs memory interactions iteratively, and a Layer-balanced Memory Loading that eliminates unbalanced KV cache loading and cross-attention computation across different layers.

Abstract

Memory-augmented Large Language Models (LLMs) have demonstrated remarkable capability for complex and long-horizon embodied planning. By keeping track of past experiences and environmental states, memory enables LLMs to maintain a global view, thereby avoiding repetitive exploration. However, existing approaches often store the memory as raw text, leading to excessively long prompts and high prefill latency. While it is possible to store and reuse the KV caches, the efficiency benefits are greatly undermined due to frequent KV cache updates. In this paper, we propose KEEP, a KV-cache-centric memory management system for efficient embodied planning. KEEP features 3 key innovations: (1) a Static-Dynamic Memory Construction algorithm that reduces KV cache recomputation by mixed-granularity memory group; (2) a Multi-hop Memory Re-computation algorithm that dynamically identifies important cross-attention among different memory groups and reconstructs memory interactions iteratively; (3) a Layer-balanced Memory Loading that eliminates unbalanced KV cache loading and cross-attention computation across different layers. Extensive experimental results have demonstrated that KEEP achieves 2.68x speedup with negligible accuracy loss compared with text-based memory methods on ALFRED dataset. Compared with the KV re-computation method CacheBlend (EuroSys'25), KEEP shows 4.13% success rate improvement and 1.90x time-to-first-token (TTFT) reduction. Our code is available on https://github.com/PKU-SEC-Lab/KEEP_Embodied_Memory.
Paper Structure (14 sections, 10 figures, 3 tables)

This paper contains 14 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: (a) An example of embodied planning with LLM planner. In each step, a prompt composed of retrieved memory and instruction is given to the LLM planner. With the number of retrieved memory segments increasing, the success rate and prefilling latency both increase, evaluated on ALFRED dataset with (b) Qwen-14B and (c) Qwen-32B (INT4).
  • Figure 2: Comparison with previous KV reuse methods on memory construction.
  • Figure 3: Impact of different block sizes for KV recomputation methods, evaluated on ALFRED with CacheBlend method using (a) Qwen-14B and (b) Qwen-32B (INT4).
  • Figure 4: Method comparison on KV recomputation.
  • Figure 5: Different memories show different update frequencies. Here we use a coarse-grained memory classification as an example.
  • ...and 5 more figures