Table of Contents
Fetching ...

FREESH: Fair, Resource- and Energy-Efficient Scheduling for LLM Serving on Heterogeneous GPUs

Xuan He, Zequan Fang, Jinzhao Lian, Danny H. K. Tsang, Baosen Zhang, Yize Chen

TL;DR

FREESH tackles the carbon- and energy-efficient serving of LLMs across heterogeneous GPUs by introducing a cross-layer framework combining pool-level routing, MIAD dynamic frequency scaling, and LLF request scheduling. It leverages spatiotemporal carbon intensity and traffic forecasts to allocate GPUs across locations, partition requests by type, and adapt GPU frequencies to meet SLOs with minimal energy and emissions. The approach yields substantial reductions in energy (28.6%) and emissions (45.45%) while improving SLO attainment and fairness, validated across production-like workloads and diverse datasets. By coordinating routing, scheduling, and DVFS across distributed data centers, FREESH demonstrates practical, open-source strategies for carbon-aware LLM serving at scale.

Abstract

The ever-increasing computation and energy demand for LLM and AI agents call for holistic and efficient optimization of LLM serving systems. In practice, heterogeneous GPU clusters can be deployed in a geographically distributed manner, while LLM load also observes diversity in terms of both query traffic and serving patterns. LLM queries running on advanced GPUs during a high-emission hour at one location can lead to significantly higher carbon footprints versus same queries running on mid-level GPUs at a low-emission time and location. By observing LLM serving requirements and leveraging spatiotemporal computation flexibility, we consider the joint routing and scheduling problem, and propose FREESH to cooperatively run a group of data centers while minimizing user-specified carbon or energy objectives. FREESH identifies the optimal configurations of balanced load serving by matching distinct GPU instance's power-throughput characteristics with predictable LLM query length and workloads. To ensure both latency and fairness requirements, FREESH identifies optimized parallelism and query routing schedules together with dynamic GPU frequency scaling for power saving, and Least-Laxity-First (LLF) serving strategy for query scheduling. During the 1-hour serving on production workloads, FREESH reduces energy by 28.6% and emissions by 45.45% together with improvements in SLO attainment and fairness.

FREESH: Fair, Resource- and Energy-Efficient Scheduling for LLM Serving on Heterogeneous GPUs

TL;DR

FREESH tackles the carbon- and energy-efficient serving of LLMs across heterogeneous GPUs by introducing a cross-layer framework combining pool-level routing, MIAD dynamic frequency scaling, and LLF request scheduling. It leverages spatiotemporal carbon intensity and traffic forecasts to allocate GPUs across locations, partition requests by type, and adapt GPU frequencies to meet SLOs with minimal energy and emissions. The approach yields substantial reductions in energy (28.6%) and emissions (45.45%) while improving SLO attainment and fairness, validated across production-like workloads and diverse datasets. By coordinating routing, scheduling, and DVFS across distributed data centers, FREESH demonstrates practical, open-source strategies for carbon-aware LLM serving at scale.

Abstract

The ever-increasing computation and energy demand for LLM and AI agents call for holistic and efficient optimization of LLM serving systems. In practice, heterogeneous GPU clusters can be deployed in a geographically distributed manner, while LLM load also observes diversity in terms of both query traffic and serving patterns. LLM queries running on advanced GPUs during a high-emission hour at one location can lead to significantly higher carbon footprints versus same queries running on mid-level GPUs at a low-emission time and location. By observing LLM serving requirements and leveraging spatiotemporal computation flexibility, we consider the joint routing and scheduling problem, and propose FREESH to cooperatively run a group of data centers while minimizing user-specified carbon or energy objectives. FREESH identifies the optimal configurations of balanced load serving by matching distinct GPU instance's power-throughput characteristics with predictable LLM query length and workloads. To ensure both latency and fairness requirements, FREESH identifies optimized parallelism and query routing schedules together with dynamic GPU frequency scaling for power saving, and Least-Laxity-First (LLF) serving strategy for query scheduling. During the 1-hour serving on production workloads, FREESH reduces energy by 28.6% and emissions by 45.45% together with improvements in SLO attainment and fairness.

Paper Structure

This paper contains 55 sections, 1 theorem, 13 equations, 21 figures, 10 tables, 1 algorithm.

Key Result

Proposition 1

For the utility maximization problem equation equ:utility_max, with concave, continuously differentiable $U_i(x_i)$ and convex, continuously differentiable $P(f)$, the following condition holds for the optimal GPU frequency $f^*$: Implementing the following primal-dual update rule will converge to the global optimum $f^*$ with given stepsize $k_x, k_f, k_r$:

Figures (21)

  • Figure 1: Spatiotemporal routing of LLM load for carbon and energy reduction. Locational marginal emission timeseries for these two data centers are distinctive, giving opportunities for coordinating GPU clusters adaptive to incoming workload characteristics.
  • Figure 2: Resource utilization improved by request partition.
  • Figure 3: Power-latency optimal trade-off reached at the marked point. Two GPU models are profiled as an illustration.
  • Figure 4: Toy example illustrates three requests $R_0$, $R_1$, and $R_2$ arriving sequentially at $t = 0$,$1$, and $2$, each requiring 10, 2, and 1 tokens, respectively, with a throughput of 1 token per second and scheduling policy update every second. In FCFS, $R_0$ occupies the worker entirely, causing head-of-line blocking for $R_1$ and $R_2$. In contrast, LLF dynamically prioritizes requests with smaller laxity values, allowing more urgent requests to proceed earlier. As a result, LLF achieves lower average latency, lower TTFT, and better fairness across concurrent requests compared with FCFS.
  • Figure 5: Workflow of proposed FREESH for LLM serving. Here, a three-level approach is developed enabling joint optimization of emission, energy, and SLOs objectives. First, an integer programming (IP) model dynamically assigns GPUs from different locations to virtual pools every 30 minutes, with each pool tailored to a specific request category based on time-varying emission rates, performance profiles, and traffic forecasts. Incoming requests are then classified, routed to a specific queue, and scheduled for serving by a LLF algorithm that prioritizes fairness. During response generation, an MIAD algorithm dynamically adjusts the GPU's clock frequency every second to balance the trade-off between energy consumption and SLO performance.
  • ...and 16 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof