Table of Contents
Fetching ...

SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving

Bohan Zhao, Zane Cao, Yongchao He

TL;DR

The paper identifies sampling as a structural bottleneck in distributed LLM inference, stubbornly serial and last-stage, limiting throughput as tensor and pipeline parallelism scale. It proposes SIMPLE, a stage-agnostic, sequence-parallel decision-plane service that offloads sampling to CPUs with three core mechanisms: sequence-parallel sampling, column-wise CPU penalties with truncation-first filtering, and speculative hot-vocab sampling (SHVS) with rejection-correctness. The approach delivers substantial gains—up to 96% end-to-end throughput and large tail-latency reductions—across multiple models and GPU generations, with modest CPU overhead and no user changes required. By decoupling sampling from the GPU data plane and aligning with modern TP/PP, SIMPLE enables scalable, low-latency LLM serving that compounds benefits as hardware evolves.

Abstract

As large language models (LLMs) scale out with tensor parallelism (TP) and pipeline parallelism (PP) and production stacks have aggressively optimized the data plane (attention/GEMM and KV cache), sampling, the decision plane that turns logits into tokens, becomes a new bottleneck. This creates a structural holdout: sampling neither expands with TP nor balances across PP stages, so its share of iteration time grows as GPUs get faster and it caps pipeline frequency at the last stage. We present SIMPLE, a stage-agnostic, sequence-parallel, overlappable decision plane that disaggregates sampling into a CPU-side service and shrinks its runtime footprint back to a minor, hidden role. SIMPLE combines: (1) sequence-parallel sampling, which shards work along the batch dimension and removes vocabulary-axis collectives; (2) a CPU-based algorithm with column-wise penalties and truncation-first filtering to realize single-pass, linear-time kernels; and (3) speculative hot-vocab sampling (SHVS), which samples on a small hot set with rejection-correctness and uses a simple sizing model to choose the hot-vocab size that maximizes throughput. In evaluation, SIMPLE improves end-to-end throughput by up to 96% and reduces P95 latency by 20-65%. Crucially, SIMPLE requires no user-side code changes and composes with existing data-plane optimizations, unlocking scaling benefits that compound with future GPU generations.

SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving

TL;DR

The paper identifies sampling as a structural bottleneck in distributed LLM inference, stubbornly serial and last-stage, limiting throughput as tensor and pipeline parallelism scale. It proposes SIMPLE, a stage-agnostic, sequence-parallel decision-plane service that offloads sampling to CPUs with three core mechanisms: sequence-parallel sampling, column-wise CPU penalties with truncation-first filtering, and speculative hot-vocab sampling (SHVS) with rejection-correctness. The approach delivers substantial gains—up to 96% end-to-end throughput and large tail-latency reductions—across multiple models and GPU generations, with modest CPU overhead and no user changes required. By decoupling sampling from the GPU data plane and aligning with modern TP/PP, SIMPLE enables scalable, low-latency LLM serving that compounds benefits as hardware evolves.

Abstract

As large language models (LLMs) scale out with tensor parallelism (TP) and pipeline parallelism (PP) and production stacks have aggressively optimized the data plane (attention/GEMM and KV cache), sampling, the decision plane that turns logits into tokens, becomes a new bottleneck. This creates a structural holdout: sampling neither expands with TP nor balances across PP stages, so its share of iteration time grows as GPUs get faster and it caps pipeline frequency at the last stage. We present SIMPLE, a stage-agnostic, sequence-parallel, overlappable decision plane that disaggregates sampling into a CPU-side service and shrinks its runtime footprint back to a minor, hidden role. SIMPLE combines: (1) sequence-parallel sampling, which shards work along the batch dimension and removes vocabulary-axis collectives; (2) a CPU-based algorithm with column-wise penalties and truncation-first filtering to realize single-pass, linear-time kernels; and (3) speculative hot-vocab sampling (SHVS), which samples on a small hot set with rejection-correctness and uses a simple sizing model to choose the hot-vocab size that maximizes throughput. In evaluation, SIMPLE improves end-to-end throughput by up to 96% and reduces P95 latency by 20-65%. Crucially, SIMPLE requires no user-side code changes and composes with existing data-plane optimizations, unlocking scaling benefits that compound with future GPU generations.

Paper Structure

This paper contains 25 sections, 12 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Sampling bottlenecks in inference on 8$\times$H100 .Bars denote iteration time; filled regions denote computation time.
  • Figure 2: Architecture and workflow of SIMPLE.
  • Figure 3: End-to-end throughput (tokens/s) across platforms and models.
  • Figure 4: TPOT ECDF on L40 (P95 marked).
  • Figure 5: TPOT ECDF on H100 (P95 marked).
  • ...and 8 more figures