GORGO: Maximizing KV-Cache Reuse While Minimizing Network Latency in Cross-Region LLM Load Balancing

Alessio Ricci Toniolo; Abinaya Dinesh; Rome Thorstenson

GORGO: Maximizing KV-Cache Reuse While Minimizing Network Latency in Cross-Region LLM Load Balancing

Alessio Ricci Toniolo, Abinaya Dinesh, Rome Thorstenson

TL;DR

GORGO addresses TTFT in geo-distributed LLM inference by formulating per-request routing as a joint optimization over KV-cache locality, inter-region latency, and admission/queue state. It introduces a cost model and a distributed per-region policy that preserves local capacity while making selective cross-region forwards guided by measured RTT and prefix-overlap signals; a centralized proxy variant further boosts performance. Experimental results across three regions demonstrate substantial TTFT reductions (notably median improvements around 2.5x) and reveal the value of incorporating network latency into cache-aware routing. The work provides a practical, scalable framework for TTFT-sensitive, geo-distributed LLM serving and highlights the trade-offs between centralized coordination and fully distributed control.

Abstract

Distributing LLM inference across geographical regions can improve Time-to-First-Token (TTFT) by regionalizing service deployments. While existing multi-region load balancers save prefill computation by prioritizing Key--Value (KV) Cache hit rate, they ignore cluster networking latency, a critical factor in routing decisions. We introduce GORGO, a method for minimizing TTFT by optimizing a total serving cost as a function of available compute, network latency, and prefix caching. Using extensive profiling on custom infrastructure, we analyze component-level latency bottlenecks and benchmark GORGO against three baselines: (1) naive least-load routing, which ignores prefix-cache overlap; (2) prefix-similarity routing, which selectively pushes requests to the replica with the highest cached-prefix overlap; and (3) a centralized HTTP proxy that runs the GORGO policy while tracking requests across all nodes. We demonstrate that GORGO reduces P99 TTFT through network-aware routing and improves average TTFT by preventing pathological cross-region forwarding. Additionally, we find that GORGO-proxy overcomes synchronization overhead in previous methods and is 2.5x faster on median TTFT, demonstrating the success of a centralized router.

GORGO: Maximizing KV-Cache Reuse While Minimizing Network Latency in Cross-Region LLM Load Balancing

TL;DR

Abstract

Paper Structure (54 sections, 2 equations, 4 figures, 5 tables)

This paper contains 54 sections, 2 equations, 4 figures, 5 tables.

Introduction
Background and Motivation
Serving Primitives: Continuous Batching and KV Caching
Prefix-Aware Routing and Its Coordination Costs
Geo-Distributed Inference and Cross-Region Routing
Routing Objectives and the Network--Cache Tradeoff
Routing and the GORGO Policy
Initial Geo-proximal Routing
Design objective and signal requirements
System architecture and routing workflow
Local control state.
Peer summaries.
Per-request decision.
Per-request objective: cost model
Translating prefix overlap into time.
...and 39 more sections

Figures (4)

Figure 1: Overview of the load balancer control flow per request.
Figure 2: Median latency and throughput metrics across methods.
Figure 3: Mean latency and throughput metrics across methods.
Figure 4: Perfetto Trace of GPU-2 During Prefix Trie Request Forwarding.

GORGO: Maximizing KV-Cache Reuse While Minimizing Network Latency in Cross-Region LLM Load Balancing

TL;DR

Abstract

GORGO: Maximizing KV-Cache Reuse While Minimizing Network Latency in Cross-Region LLM Load Balancing

Authors

TL;DR

Abstract

Table of Contents

Figures (4)