Table of Contents
Fetching ...

WANSpec: Leveraging Global Compute Capacity for LLM Inference

Noah Martin, Fahad Dogar

TL;DR

WANSpec addresses the problem of uneven global load and tail latency in LLM inference by offloading part of speculative decoding to underutilized WAN-connected compute. It introduces a controller–worker architecture that uses entropy-based predictions to add redundant draft decoding only when needed, preserving latency while reducing the draft-model workload. The authors validate the approach through a measurement study across AWS regions, a flexible simulator, and cloud deployments, showing substantial reductions in draft-token work and robust latency behavior under realistic conditions. This work demonstrates a practical path to exploiting global compute capacity to relieve data-center pressure and reduce costs without sacrificing interactive response times.

Abstract

Data centers capable of running large language models (LLMs) are spread across the globe. Some have high end GPUs for running the most advanced models (100B+ parameters), and others are only suitable for smaller models (1B parameters). The most capable GPUs are under high demand thanks to the rapidly expanding applications of LLMs. Choosing the right location to run an LLM inference workload can have consequences on the latency of requests due to these high demands. In this work, we explore options to shift some aspects of inference to the under-utilized data centers. We first observe the varying delays affecting inference in AWS services from different regions, demonstrating that load is not spread evenly. We then introduce WANSpec, which offloads part of LLM generation to the under-utilized data centers. In doing so, WANSpec can mitigate capacity issues as well as effectively use on-site compute (ie at universities) to augment cloud providers. This is done with speculative decoding, a widely used technique to speed up auto-regressive decoding, by moving the draft model to the under-utilized compute resources. Our experiments in simulation and cloud deployments show that WANSpec can judiciously employ redundancy to avoid increases in latency while still reducing the forward passes of speculative decoding's draft model in high demand data centers by over 50%.

WANSpec: Leveraging Global Compute Capacity for LLM Inference

TL;DR

WANSpec addresses the problem of uneven global load and tail latency in LLM inference by offloading part of speculative decoding to underutilized WAN-connected compute. It introduces a controller–worker architecture that uses entropy-based predictions to add redundant draft decoding only when needed, preserving latency while reducing the draft-model workload. The authors validate the approach through a measurement study across AWS regions, a flexible simulator, and cloud deployments, showing substantial reductions in draft-token work and robust latency behavior under realistic conditions. This work demonstrates a practical path to exploiting global compute capacity to relieve data-center pressure and reduce costs without sacrificing interactive response times.

Abstract

Data centers capable of running large language models (LLMs) are spread across the globe. Some have high end GPUs for running the most advanced models (100B+ parameters), and others are only suitable for smaller models (1B parameters). The most capable GPUs are under high demand thanks to the rapidly expanding applications of LLMs. Choosing the right location to run an LLM inference workload can have consequences on the latency of requests due to these high demands. In this work, we explore options to shift some aspects of inference to the under-utilized data centers. We first observe the varying delays affecting inference in AWS services from different regions, demonstrating that load is not spread evenly. We then introduce WANSpec, which offloads part of LLM generation to the under-utilized data centers. In doing so, WANSpec can mitigate capacity issues as well as effectively use on-site compute (ie at universities) to augment cloud providers. This is done with speculative decoding, a widely used technique to speed up auto-regressive decoding, by moving the draft model to the under-utilized compute resources. Our experiments in simulation and cloud deployments show that WANSpec can judiciously employ redundancy to avoid increases in latency while still reducing the forward passes of speculative decoding's draft model in high demand data centers by over 50%.
Paper Structure (22 sections, 9 figures, 2 algorithms)

This paper contains 22 sections, 9 figures, 2 algorithms.

Figures (9)

  • Figure 1: Locations of AWS data centers. Stars mark the three regions supporting Claude Opus 4.1 in AWS Bedrock. Triangles mark the more prevalent, but less powerful, Claude Haiku.
  • Figure 2: Median (p50) and tail (p95) time to first token of Claude 3 Haiku between AWS regions. Each cell shows p50/p95 latency (in seconds) for a request originating from the source region (rows) to a target Bedrock region (columns). The data were collected over 3 days. In \ref{['subfig:p50_heatmap']} intra-region latencies are lowest, as expected. In \ref{['subfig:p95_heatmap']}, some regions---particularly eu-west-2, us-east-1, and us-west-2---exhibit the lowest latencies for inter-region requests. This suggests inference queuing dominates network latency in theses cases.
  • Figure 3: Results of AWS Bedrock measurements over three days, for the same source and target region. Some regions (eu-west-2 in this graph) exhibit diurnal patterns, while others (us-west-2) do not. This pattern held in our original experiment in 2025 as well as the 2026 repeat.
  • Figure 4: Results of our experiments with requests originating in eu-west-2 and targeting eu-west-2 and ap-south-1. \ref{['subfig:eu-p50']} and \ref{['subfig:eu-p90']} plot the median/p90 TTFTs for each hour. The median intra-region requests exhibit diurnal patterns. At p90, the intra-region TTFT shows significant instability, while the inter-region requests are more stable and at times lower latency. To help rule our network latency as the bottleneck, Fig \ref{['subfig:network-latency']} plots the distribution of TCP connects we measured during our 2026 experiment, with the top 1% of outliers excluded for readability. Intra-region requests have low latency. Requests across the WAN have higher, but still stable, latency. The stability of the latency but instability of TTFT indicates increased queuing causes the slowdown.
  • Figure 5: Comparison of using speculative generation sequentially vs. in parallel. Overlapping draft and target model forward passes creates slack which can mask network latency. This parallelization requires one extra forward pass of the draft model per token generated by the target model.
  • ...and 4 more figures