PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models
Eunyeong Cho, Jehyeon Bang, Ranggi Hwang, Minsoo Rhu
TL;DR
PASCAL addresses the unique latency characteristics of reasoning-based LLMs by introducing phase-aware scheduling that separates the decoding phase into a reasoning and an answering phase. The system uses a hierarchical, two-level scheduler: an instance-level component that assigns requests to GPU instances and can migrate at phase boundaries, and an intra-instance component that prioritizes reasoning (high-priority RR scheduling) and answering (low-priority RR with a token pacer). Empirical results on DeepSeek-R1-Distill-Qwen-32B show significant reductions in tail $TTFT$ (up to 72%) and robust $SLO$ attainment for the answering phase, while maintaining throughput similar to baselines; KV cache transfer overhead remains negligible in the face of long reasoning phases. These findings demonstrate that phase-aware, memory-conscious scheduling is critical for deploying reasoning-based LLMs at scale, enabling faster perceived responses without sacrificing quality or throughput.
Abstract
The emergence of reasoning-based LLMs leveraging Chain-of-Thought (CoT) inference introduces new serving challenges, as their extended reasoning phases delay user-visible output and inflate Time-To-First-Token (TTFT). Existing LLM serving frameworks fail to distinguish between reasoning and answering phases, leading to performance degradation under GPU memory constraints. We present PASCAL, a phase-aware scheduling algorithm that prioritizes reasoning to reduce TTFT while using controlled preemption and token pacing during answering to preserve Quality-of-Experience (QoE). Our hierarchical scheduler combines instance-level placement with intra-instance execution and enables dynamic migration at phase boundaries to balance load and reduce interference. Across benchmarks using DeepSeek-R1-Distill-Qwen-32B, PASCAL reduces tail TTFT by up to 72% while maintaining answering phase SLO attainment, demonstrating the importance of phase-aware scheduling for reasoning-based LLM deployment.
