Table of Contents
Fetching ...

Low-Latency Edge LLM Handover via Joint KV Cache Transfer and Token Prefill

Seunghun Lee, Jihong Park, Ce Zheng, Hyuncheol Park

Abstract

Edge deployment of large language models (LLMs) can reduce latency for interactive services, but mobility introduces service interruptions when an user equipment (UE) hands over between base stations (BSs). To promptly resume decoding, the target-side edge server must recover the UE context state, which can be provisioned either by token forwarding followed by prefill computation or by direct key-value (KV) cache transmission over backhaul. This paper proposes a unified handover (HO) design that jointly selects the prefill length and schedules backhaul KV cache delivery to minimize the worst-user LLM HO delay for multiple UEs. The resulting scheme admits a tractable step-wise solution with explicit feasibility conditions and a constructive rate-scheduling policy. Simulations show that the proposed method consistently outperforms baselines across a wide range of backhaul capacities, prefill speeds, and context sizes, providing practical guidelines for mobility-aware Edge LLM token streaming.

Low-Latency Edge LLM Handover via Joint KV Cache Transfer and Token Prefill

Abstract

Edge deployment of large language models (LLMs) can reduce latency for interactive services, but mobility introduces service interruptions when an user equipment (UE) hands over between base stations (BSs). To promptly resume decoding, the target-side edge server must recover the UE context state, which can be provisioned either by token forwarding followed by prefill computation or by direct key-value (KV) cache transmission over backhaul. This paper proposes a unified handover (HO) design that jointly selects the prefill length and schedules backhaul KV cache delivery to minimize the worst-user LLM HO delay for multiple UEs. The resulting scheme admits a tractable step-wise solution with explicit feasibility conditions and a constructive rate-scheduling policy. Simulations show that the proposed method consistently outperforms baselines across a wide range of backhaul capacities, prefill speeds, and context sizes, providing practical guidelines for mobility-aware Edge LLM token streaming.

Paper Structure

This paper contains 12 sections, 3 theorems, 25 equations, 4 figures.

Key Result

Proposition 1

The pair $(L^{\star},\, r^{\star}(\cdot))$ is a global optimum of ${\mathscr P}$.

Figures (4)

  • Figure 1: An illustration of the proposed ctHO, which jointly exploits batch prefill, and KV cache transfer to minimize the worst-user HO delay.
  • Figure 2: An illustration of the proposed design principle. The HO prefill delay $D^{(\mathrm{pf})}(L)$ and the minimum cache transfer delay $D_{(\mathrm{tx})}^{\star}(L)$ are set to equal by choosing the batch prefill length $L$, so that neither process dominates the overall LLM HO delay.
  • Figure 3: Worst-user LLM HO delay comparison of ctHO, tHO, and cHO under different system parameters.
  • Figure 4: Worst-user total streaming delay vs. distance between BSs with $R_{\mathrm{bh}}=4.5$ Gbps. The total streaming delay includes both the LLM HO delay and the subsequent streaming delay to deliver generated $G$ tokens after LLM HO.

Theorems & Definitions (6)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof