Table of Contents
Fetching ...

Adaptive Contextual Caching for Mobile Edge Large Language Model Service

Guangyuan Liu, Yinqiu Liu, Jiacheng Wang, Hongyang Du, Dusit Niyato, Jiawen Kang, Zehui Xiong

TL;DR

An Adaptive Contextual Caching framework that anticipates user needs by proactively caching semantically relevant data for mobile-edge LLMs, enabling scalable, low-latency LLM services in resource-constrained edge environments.

Abstract

Mobile edge Large Language Model (LLM) deployments face inherent constraints, such as limited computational resources and network bandwidth. Although Retrieval-Augmented Generation (RAG) mitigates some challenges by integrating external knowledge bases, inefficient cache management can still result in high retrieval latency and frequent cache updates. To address these issues, we propose an Adaptive Contextual Caching (ACC) framework that anticipates user needs by proactively caching semantically relevant data for mobile-edge LLMs. ACC utilizes a deep reinforcement learning (DRL) module to refine cache replacement policies, balancing user context, document similarity, and the overhead associated with cache misses. Experimental results demonstrate that ACC increases cache hit rates to over 80\% after only 11 training episodes, outperforming FIFO, LRU, and semantic-only caching while reducing retrieval latency by up to 40\%. In particular, ACC also reduces local caching overhead (i.e., the cost of updating the cache when a miss occurs) by as much as 55\%, enabling scalable, low-latency LLM services in resource-constrained edge environments.

Adaptive Contextual Caching for Mobile Edge Large Language Model Service

TL;DR

An Adaptive Contextual Caching framework that anticipates user needs by proactively caching semantically relevant data for mobile-edge LLMs, enabling scalable, low-latency LLM services in resource-constrained edge environments.

Abstract

Mobile edge Large Language Model (LLM) deployments face inherent constraints, such as limited computational resources and network bandwidth. Although Retrieval-Augmented Generation (RAG) mitigates some challenges by integrating external knowledge bases, inefficient cache management can still result in high retrieval latency and frequent cache updates. To address these issues, we propose an Adaptive Contextual Caching (ACC) framework that anticipates user needs by proactively caching semantically relevant data for mobile-edge LLMs. ACC utilizes a deep reinforcement learning (DRL) module to refine cache replacement policies, balancing user context, document similarity, and the overhead associated with cache misses. Experimental results demonstrate that ACC increases cache hit rates to over 80\% after only 11 training episodes, outperforming FIFO, LRU, and semantic-only caching while reducing retrieval latency by up to 40\%. In particular, ACC also reduces local caching overhead (i.e., the cost of updating the cache when a miss occurs) by as much as 55\%, enabling scalable, low-latency LLM services in resource-constrained edge environments.
Paper Structure (24 sections, 5 figures)

This paper contains 24 sections, 5 figures.

Figures (5)

  • Figure 1: Comparison of Standard RAG and Contextual RAG workflows. Standard RAG retrieves general-purpose knowledge from the knowledge bases to respond to queries, while Contextual RAG augments retrieval with real-time and localized data for enriched responses. The example demonstrates the added value of Contextual RAG in providing actionable insights for a weather-related driving query.
  • Figure 2: Illustration of key components in a caching system for RAG. The knowledge bases is divided into chunks, vectorized into high-dimensional embeddings via embedding models, and indexed (e.g., using HNSW) for efficient similarity search. Frequently accessed or computationally expensive embeddings are stored in the vector cache, which dynamically replaces expired or non-relevant vectors based on replacement policies. Query embeddings are compared with indexed vectors, and the most relevant contexts are retrieved and combined with the query to form a final prompt for the LLM.
  • Figure 3: Illustration of the proposed ACC in a mobile-edge LLM scenario, contrasting a conventional retrieval‐only flow with proactive caching approach. Upon receiving a user prompt (①), the Edge LLM checks the cache server for relevant knowledge. If the cache misses (②), the system retrieves the needed content (e.g., $T_1$) from the knowledge base (③). The DRL agent selectively updates the cache by deciding whether to store or replace entries (④). Once updated, subsequent queries can yield a cache hit (⑤) without additional retrieval overhead. In the example shown, the mobile unit requests speed guidelines for Maple Avenue and relevant chunks (e.g., $T_2, T_3$) are proactively cached, reducing future latency and overhead.
  • Figure 4: (a) Cache hit rate trends over multiple episodes for different caching strategies. (b) Comparison of the average retrieval latency in seconds. The proposed ACC method achieves higher hit rates and lower latency than baseline approaches.
  • Figure 5: Average caching consumption under varying cache sizes. ACC maintains lower consumption compared to FIFO, LRU, and semantic caching at multiple cache capacities.