Table of Contents
Fetching ...

DisCEdge: Distributed Context Management for Large Language Models at the Edge

Mohammadreza Malekabbasi, Minghe Wang, David Bermbach

TL;DR

The paper tackles the challenge of maintaining consistent user context for geo-distributed LLM inference at the edge. It proposes DisCEdge, a system that stores and replicates pre-tokenized user context across edge nodes, enabling efficient prompt construction and reducing cross-node communication. Through a prototype and experiments, it demonstrates up to 14.46% median latency reduction, up to 15% lower synchronization overhead, and around 90% smaller client request sizes compared with client-side context, while preserving data consistency. This approach offers practical benefits for latency-sensitive, privacy-conscious edge deployments and opens pathways for scalable, token-based context replication in distributed AI systems.

Abstract

Deploying Large Language Model (LLM) services at the edge benefits latency-sensitive and privacy-aware applications. However, the stateless nature of LLMs makes managing user context (e.g., sessions, preferences) across geo-distributed edge nodes challenging. Existing solutions, such as client-side context storage, often introduce network latency and bandwidth overhead, undermining the advantages of edge deployment. We propose DisCEdge, a distributed context management system that stores and replicates user context in tokenized form across edge nodes. By maintaining context as token sequences rather than raw text, our system avoids redundant computation and enables efficient data replication. We implement and evaluate an open-source prototype in a realistic edge environment with commodity hardware. We show DisCEdge improves median response times by up to 14.46% and lowers median inter-node synchronization overhead by up to 15% compared to a raw-text-based system. It also reduces client request sizes by a median of 90% compared to client-side context management, while guaranteeing data consistency.

DisCEdge: Distributed Context Management for Large Language Models at the Edge

TL;DR

The paper tackles the challenge of maintaining consistent user context for geo-distributed LLM inference at the edge. It proposes DisCEdge, a system that stores and replicates pre-tokenized user context across edge nodes, enabling efficient prompt construction and reducing cross-node communication. Through a prototype and experiments, it demonstrates up to 14.46% median latency reduction, up to 15% lower synchronization overhead, and around 90% smaller client request sizes compared with client-side context, while preserving data consistency. This approach offers practical benefits for latency-sensitive, privacy-conscious edge deployments and opens pathways for scalable, token-based context replication in distributed AI systems.

Abstract

Deploying Large Language Model (LLM) services at the edge benefits latency-sensitive and privacy-aware applications. However, the stateless nature of LLMs makes managing user context (e.g., sessions, preferences) across geo-distributed edge nodes challenging. Existing solutions, such as client-side context storage, often introduce network latency and bandwidth overhead, undermining the advantages of edge deployment. We propose DisCEdge, a distributed context management system that stores and replicates user context in tokenized form across edge nodes. By maintaining context as token sequences rather than raw text, our system avoids redundant computation and enables efficient data replication. We implement and evaluate an open-source prototype in a realistic edge environment with commodity hardware. We show DisCEdge improves median response times by up to 14.46% and lowers median inter-node synchronization overhead by up to 15% compared to a raw-text-based system. It also reduces client request sizes by a median of 90% compared to client-side context management, while guaranteeing data consistency.

Paper Structure

This paper contains 25 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: DisCEdge Architecture Overview
  • Figure 2: The "LLM Service" as an inference framework, and its abstract integration with the Context Manager.
  • Figure 3: Client-observable response time per turn for tokenized versus raw text context storage on M2 and TX2 nodes. Error bars represent the 95% confidence interval.
  • Figure 4: Tokens generated per second (TPS) for tokenized versus raw text context storage. The tokenized approach shows a modest performance improvement, which is more pronounced on the resource-constrained TX2 node.
  • Figure 5: Network overhead for synchronizing context data between edge nodes, comparing tokenized versus raw text storage. Storing context as tokens reduces network usage compared to raw text. The network packets were collected on the M2 node.
  • ...and 2 more figures