Table of Contents
Fetching ...

BiScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS

Omar Basit, Yunzhao Liu, Z. Jonny Kong, Y. Charlie Hu

TL;DR

BiScale jointly optimizes placement and DVFS across prefill and decode across prefill and decode using predictive latency and power models and enables coordinated control across timescales while preserving strict serving SLOs.

Abstract

Prefill/decode disaggregation is increasingly adopted in LLM serving to improve the latency-throughput tradeoff and meet strict TTFT and TPOT SLOs. However, LLM inference remains energy-hungry: autoscaling alone is too coarse-grained to track fast workload fluctuations, and applying fine-grained DVFS under disaggregation is complicated by phase-asymmetric dynamics and coupling between provisioning and frequency control. We present BiScale, a two-tier energy optimization framework for disaggregated LLM serving. BiScale jointly optimizes placement and DVFS across prefill and decode using predictive latency and power models. At coarse timescales, BiScale computes phase-aware placement and baseline frequencies that minimize energy while satisfying SLO constraints. At fine timescales, BiScale dynamically adapts GPU frequency per iteration using stage-specific control: model predictive control (MPC) for prefill to account for queue evolution and future TTFT impact, and lightweight slack-aware adaptation for decode to exploit its smoother, memory-bound dynamics. This hierarchical design enables coordinated control across timescales while preserving strict serving SLOs. Evaluation on a 16x H100 cluster serving Llama 3.3 70B with production-style traces shows that BiScale meets TTFT/TPOT SLOs while reducing energy by up to 39% in prefill and 48% in decode relative to DistServe.

BiScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS

TL;DR

BiScale jointly optimizes placement and DVFS across prefill and decode across prefill and decode using predictive latency and power models and enables coordinated control across timescales while preserving strict serving SLOs.

Abstract

Prefill/decode disaggregation is increasingly adopted in LLM serving to improve the latency-throughput tradeoff and meet strict TTFT and TPOT SLOs. However, LLM inference remains energy-hungry: autoscaling alone is too coarse-grained to track fast workload fluctuations, and applying fine-grained DVFS under disaggregation is complicated by phase-asymmetric dynamics and coupling between provisioning and frequency control. We present BiScale, a two-tier energy optimization framework for disaggregated LLM serving. BiScale jointly optimizes placement and DVFS across prefill and decode using predictive latency and power models. At coarse timescales, BiScale computes phase-aware placement and baseline frequencies that minimize energy while satisfying SLO constraints. At fine timescales, BiScale dynamically adapts GPU frequency per iteration using stage-specific control: model predictive control (MPC) for prefill to account for queue evolution and future TTFT impact, and lightweight slack-aware adaptation for decode to exploit its smoother, memory-bound dynamics. This hierarchical design enables coordinated control across timescales while preserving strict serving SLOs. Evaluation on a 16x H100 cluster serving Llama 3.3 70B with production-style traces shows that BiScale meets TTFT/TPOT SLOs while reducing energy by up to 39% in prefill and 48% in decode relative to DistServe.
Paper Structure (43 sections, 2 equations, 14 figures, 2 tables, 1 algorithm)

This paper contains 43 sections, 2 equations, 14 figures, 2 tables, 1 algorithm.

Figures (14)

  • Figure 1: The RPS timelines of the Azure LLM inference trace azure-public-dastaset over 10 hours, 10 minutes, and 1 minute.
  • Figure 2: Variance-time plot of request-per-second (RPS) in the Azure LLM inference trace azure-public-dastaset. The trace exhibits notable fluctuation across both short and long timescales, with slightly greater variance observed at shorter timescales.
  • Figure 3: Number of running requests in prefill and decode instances plotted with workload in RPS.
  • Figure 4: Architecture overview of BiScale.
  • Figure 5: Results for various controlled workloads with constant average RPS
  • ...and 9 more figures