Table of Contents
Fetching ...

Token Management in Multi-Tenant AI Inference Platforms

William J. Cunningham

TL;DR

token pools are introduced, a control-plane abstraction that represents inference capacity as explicit entitlements expressed in inference-native units (token throughput, KV cache, concurrency) and supports priority-aware allocation, service tiers with differentiated guarantees, and debt-based fairness mechanisms, all without modifying the underlying inference runtime or cluster scheduler.

Abstract

Multi-tenant AI inference platforms must balance resource utilization against service-level guarantees under variable demand. Conventional approaches fail to achieve this balance: dedicated endpoints strand capacity on idle models, while rate limits ignore the heterogeneous cost of inference requests. We introduce \emph{token pools}, a control-plane abstraction that represents inference capacity as explicit entitlements expressed in inference-native units (token throughput, KV cache, concurrency). Unlike rate limits, which govern request admission without regard to execution cost, token pools authorize both admission and autoscaling from the same capacity model, ensuring consistency between what is promised and what is provisioned. The abstraction captures burst modes across multiple dimensions invisible to conventional throttling. Dynamic per-entitlement limits on each burst dimension enable fine-grained control over resource consumption while permitting work-conserving backfill by low-priority traffic. The design supports priority-aware allocation, service tiers with differentiated guarantees, and debt-based fairness mechanisms, all without modifying the underlying inference runtime or cluster scheduler. In experiments on a Kubernetes cluster with vLLM backends, token pools maintain a bounded P99 latency for guaranteed workloads during overload by selectively throttling spot traffic, while a baseline without admission control experiences unbounded latency degradation across all workloads. A second experiment demonstrates debt-based fair-share convergence among elastic workloads with heterogeneous SLO requirements during capacity scarcity.

Token Management in Multi-Tenant AI Inference Platforms

TL;DR

token pools are introduced, a control-plane abstraction that represents inference capacity as explicit entitlements expressed in inference-native units (token throughput, KV cache, concurrency) and supports priority-aware allocation, service tiers with differentiated guarantees, and debt-based fairness mechanisms, all without modifying the underlying inference runtime or cluster scheduler.

Abstract

Multi-tenant AI inference platforms must balance resource utilization against service-level guarantees under variable demand. Conventional approaches fail to achieve this balance: dedicated endpoints strand capacity on idle models, while rate limits ignore the heterogeneous cost of inference requests. We introduce \emph{token pools}, a control-plane abstraction that represents inference capacity as explicit entitlements expressed in inference-native units (token throughput, KV cache, concurrency). Unlike rate limits, which govern request admission without regard to execution cost, token pools authorize both admission and autoscaling from the same capacity model, ensuring consistency between what is promised and what is provisioned. The abstraction captures burst modes across multiple dimensions invisible to conventional throttling. Dynamic per-entitlement limits on each burst dimension enable fine-grained control over resource consumption while permitting work-conserving backfill by low-priority traffic. The design supports priority-aware allocation, service tiers with differentiated guarantees, and debt-based fairness mechanisms, all without modifying the underlying inference runtime or cluster scheduler. In experiments on a Kubernetes cluster with vLLM backends, token pools maintain a bounded P99 latency for guaranteed workloads during overload by selectively throttling spot traffic, while a baseline without admission control experiences unbounded latency degradation across all workloads. A second experiment demonstrates debt-based fair-share convergence among elastic workloads with heterogeneous SLO requirements during capacity scarcity.
Paper Structure (23 sections, 3 equations, 6 figures, 2 tables)

This paper contains 23 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: System architecture. The TokenPool controller aggregates demand from entitlements and manages backend capacity while the Virtual Node Provider projects pool capacity into Kubernetes extended resources. AI Workloads access token-based compute resources through the AI Gateway using an inference key mapped to a particular entitlement. As demand fluctuates, the Dynamo planner reacts and scales GPU workers in order to meet service level objectives across tenants.
  • Figure 2: Request queue depth during overload. (a) With token pools, running requests remain at capacity while the waiting queue stays empty; excess spot requests receive HTTP 429 responses. (b) Without admission control, the request queue grows to 34 requests, leading to sustained latency degradation.
  • Figure 3: End-to-end request latency. Token pools maintain bounded latency by rejecting excess spot requests; the baseline experiences unbounded latency growth as the queue deepens.
  • Figure 4: Pool slot utilization by entitlement. Guaranteed workloads maintain their allocations while spot is squeezed during overload. Spot recovers immediately when guaranteed-c departs.
  • Figure 5: Priority dynamics during capacity scarcity and recovery. Four panels: (1) in-flight requests with pool capacity; (2) service debt and priority weight; (3) admission and denial rates; (4) effective admission delay. Copilot (tight SLO) receives preferential admission while synth (loose SLO) absorbs throttling. Debt accumulates during underservice and decays during recovery.
  • ...and 1 more figures