BiScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS

Omar Basit; Yunzhao Liu; Z. Jonny Kong; Y. Charlie Hu

BiScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS

Omar Basit, Yunzhao Liu, Z. Jonny Kong, Y. Charlie Hu

TL;DR

BiScale jointly optimizes placement and DVFS across prefill and decode across prefill and decode using predictive latency and power models and enables coordinated control across timescales while preserving strict serving SLOs.

Abstract

Prefill/decode disaggregation is increasingly adopted in LLM serving to improve the latency-throughput tradeoff and meet strict TTFT and TPOT SLOs. However, LLM inference remains energy-hungry: autoscaling alone is too coarse-grained to track fast workload fluctuations, and applying fine-grained DVFS under disaggregation is complicated by phase-asymmetric dynamics and coupling between provisioning and frequency control. We present BiScale, a two-tier energy optimization framework for disaggregated LLM serving. BiScale jointly optimizes placement and DVFS across prefill and decode using predictive latency and power models. At coarse timescales, BiScale computes phase-aware placement and baseline frequencies that minimize energy while satisfying SLO constraints. At fine timescales, BiScale dynamically adapts GPU frequency per iteration using stage-specific control: model predictive control (MPC) for prefill to account for queue evolution and future TTFT impact, and lightweight slack-aware adaptation for decode to exploit its smoother, memory-bound dynamics. This hierarchical design enables coordinated control across timescales while preserving strict serving SLOs. Evaluation on a 16x H100 cluster serving Llama 3.3 70B with production-style traces shows that BiScale meets TTFT/TPOT SLOs while reducing energy by up to 39% in prefill and 48% in decode relative to DistServe.

BiScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS

TL;DR

Abstract

Paper Structure (43 sections, 2 equations, 14 figures, 2 tables, 1 algorithm)

This paper contains 43 sections, 2 equations, 14 figures, 2 tables, 1 algorithm.

Introduction
Background
LLM Online Serving Workload Dynamics
Energy Efficient LLM Serving
Prefill/Decode Disaggregation
The Energy Efficient P/D Disaggregation Serving Problem
Challenges
Key Observation: Phase-specific Workload Characteristics
Key Challenges
Design
Design Principles
System Architecture
Tier 1: Coarse-Grained Provisioning
The Cluster Provisioning Problem
Placement Optimization via ILP
...and 28 more sections

Figures (14)

Figure 1: The RPS timelines of the Azure LLM inference trace azure-public-dastaset over 10 hours, 10 minutes, and 1 minute.
Figure 2: Variance-time plot of request-per-second (RPS) in the Azure LLM inference trace azure-public-dastaset. The trace exhibits notable fluctuation across both short and long timescales, with slightly greater variance observed at shorter timescales.
Figure 3: Number of running requests in prefill and decode instances plotted with workload in RPS.
Figure 4: Architecture overview of BiScale.
Figure 5: Results for various controlled workloads with constant average RPS
...and 9 more figures

BiScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS

TL;DR

Abstract

BiScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS

Authors

TL;DR

Abstract

Table of Contents

Figures (14)