PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models

Eunyeong Cho; Jehyeon Bang; Ranggi Hwang; Minsoo Rhu

PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models

Eunyeong Cho, Jehyeon Bang, Ranggi Hwang, Minsoo Rhu

TL;DR

PASCAL addresses the unique latency characteristics of reasoning-based LLMs by introducing phase-aware scheduling that separates the decoding phase into a reasoning and an answering phase. The system uses a hierarchical, two-level scheduler: an instance-level component that assigns requests to GPU instances and can migrate at phase boundaries, and an intra-instance component that prioritizes reasoning (high-priority RR scheduling) and answering (low-priority RR with a token pacer). Empirical results on DeepSeek-R1-Distill-Qwen-32B show significant reductions in tail $TTFT$ (up to 72%) and robust $SLO$ attainment for the answering phase, while maintaining throughput similar to baselines; KV cache transfer overhead remains negligible in the face of long reasoning phases. These findings demonstrate that phase-aware, memory-conscious scheduling is critical for deploying reasoning-based LLMs at scale, enabling faster perceived responses without sacrificing quality or throughput.

Abstract

The emergence of reasoning-based LLMs leveraging Chain-of-Thought (CoT) inference introduces new serving challenges, as their extended reasoning phases delay user-visible output and inflate Time-To-First-Token (TTFT). Existing LLM serving frameworks fail to distinguish between reasoning and answering phases, leading to performance degradation under GPU memory constraints. We present PASCAL, a phase-aware scheduling algorithm that prioritizes reasoning to reduce TTFT while using controlled preemption and token pacing during answering to preserve Quality-of-Experience (QoE). Our hierarchical scheduler combines instance-level placement with intra-instance execution and enables dynamic migration at phase boundaries to balance load and reduce interference. Across benchmarks using DeepSeek-R1-Distill-Qwen-32B, PASCAL reduces tail TTFT by up to 72% while maintaining answering phase SLO attainment, demonstrating the importance of phase-aware scheduling for reasoning-based LLM deployment.

PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models

TL;DR

(up to 72%) and robust

attainment for the answering phase, while maintaining throughput similar to baselines; KV cache transfer overhead remains negligible in the face of long reasoning phases. These findings demonstrate that phase-aware, memory-conscious scheduling is critical for deploying reasoning-based LLMs at scale, enabling faster perceived responses without sacrificing quality or throughput.

Abstract

Paper Structure (22 sections, 16 figures, 2 algorithms)

This paper contains 22 sections, 16 figures, 2 algorithms.

Introduction
Background
LLM Serving Metrics and Execution Phases
Need for Blocking & Preempting Requests in LLM Serving
LLM Scheduling Policy
Rethinking TTFT and TPOT in Reasoning-Based LLMs
Characterization and Motivation
Methodology
Characterization
Motivation
Pascal: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based LLMs
High-level Overview
Instance-level Scheduler
Intra-instance Scheduler
Evaluation
...and 7 more sections

Figures (16)

Figure 1: Comparison of how TTFT and TPOT are defined in conventional LLMs versus reasoning-based LLMs. TTFT is the latency between the query submission time and the time the first "user-visible" response token ($t_1$) is received by the user. (a) In conventional LLMs, this first user-visible token is generated at the end of the prefill stage, so TTFT equals the prefill-stage latency. (b) In contrast, for reasoning-based LLMs, TTFT is the sum of the prefill-stage latency, the latency of the reasoning phase, and the latency from the end of the reasoning phase to the generation of the first answering token.
Figure 2: Timeline illustrating how the serving system handles requests under three different scenarios. We assume that Request A, B, and C arrive at times 0, 1, and 2, respectively, and that the GPU memory capacity allows at most two requests to be batched simultaneously. (a) With oracular execution under infinite GPU memory, there is no limit on how many requests can be batched, so each request begins execution immediately upon arrival, and no preemption occurs. (b) Under FCFS scheduling, the first two requests (A and B) are batched, while C must wait. Request C joins the batch only after A finishes decoding, yielding a TTFT of 7 time units. (c) In round‑robin (RR) scheduling, each request decodes a fixed quantum of four tokens before being preempted. Request C waits for 2 time units, joins the batch when Request A is preempted at time 4, and is itself preempted after producing four tokens at time 8.
Figure 3: An example token serving scenario with QoE measurement, where token delivery begins at the target TTFT without additional delay. (i) The serving system initially generates tokens (red line) faster than the user’s expected reading pace (black dotted line), causing tokens to be buffered by the token pacer. The user consumes tokens at their own pace, represented by the user digested line (yellow). (ii) When the serving system is temporarily paused, the user continues consuming the buffered tokens. (iii) Once all the tokens kept in the buffer are depleted, the user experiences starvation until token generation resumes. (iv) Once token generation resumes, the user continues consuming tokens at a steady pace, although the user's token digestion timeline remains behind the expected schedule due to the earlier delay experienced when the serving system paused. QoE is measured by computing the ratio between the area under the user digested token curve (yellow) and the area under the user expected curve (black dotted line).
Figure 4: (Left axis) Breakdown of average reasoning phase latency across oracle, FCFS, and RR scheduling policies for varying numbers of reasoning tokens (x-axis), normalized to the oracle latency at each reasoning token count. Executed (black) denotes time the request actively ran on the GPU without being blocked (red) or preempted (yellow). (Right axis) Corresponding average absolute latency in seconds, indicated by green triangles.
Figure 5: (a) Answering phase latency breakdown across varying numbers of answering tokens and (b) the corresponding SLO attainment rate. The SLO target for the answering phase is determined by two factors: (1) the latency between the end of the reasoning phase and the generation of the first answering token (henceforth referred to as Time-To-First-Answering-Token (TTFAT)), and (2) whether the answering token generation rate meets the TPOT target. As such, per our definition of QoE in Figure \ref{['fig:background_qoe']}, the answering phase's QoE is determined by the target values of TTFAT and TPOT. Following prior work distserve, we set the target TTFAT and TPOT values to 0.25 seconds and 100 milliseconds, respectively. A request violates the SLO if its QoE score falls below 0.95.
...and 11 more figures

PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models

TL;DR

Abstract

PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (16)