Table of Contents
Fetching ...

Cascadia: An Efficient Cascade Serving System for Large Language Models

Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Jintao Zhang, Nicholas D. Lane, Binhang Yuan

TL;DR

Cascadia tackles the challenge of efficiently serving large language models (LLMs) under latency and quality constraints by deploying cascaded models that trade off speed and accuracy. It introduces a bi-level optimization framework in which a MILP-based deployment solver determines resource allocations and parallelism across multiple cascade stages, while a Chebyshev-guided routing solver jointly optimizes request routing to meet a minimum quality while minimizing latency, using $L$ and $Q$ alongside normalization by $z_1^{\star}$ and $z_2^{\star}$. The approach is instantiated with DeepSeek and Llama cascades and evaluated against stand-alone LLMs and CascadeServe, achieving up to $4\times$ tighter latency SLOs ($2.3\times$ avg) and $5\times$ higher throughput ($2.4\times$ avg), with robust performance under fluctuating workloads via a re-scheduling mechanism. These results demonstrate the practical impact of co-designing deployment and routing for cascaded LLM serving, enabling faster, cost-efficient, quality-guaranteed responses in real-world deployments.

Abstract

Recent advances in large language models (LLMs) have intensified the need to deliver both rapid responses and high-quality outputs. More powerful models yield better results but incur higher inference latency, whereas smaller models are faster yet less capable. Recent work proposes balancing this latency-quality trade-off using model cascades, which route simpler queries to smaller models and more complex ones to larger models. However, enabling efficient cascade serving remains challenging. Current frameworks lack effective mechanisms for handling (i) the huge and varying resource demands of different LLMs, (ii) the inherent heterogeneity of LLM workloads, and (iii) the co-optimization of system deployment and routing strategy. Motivated by these observations, we introduce Cascadia, a novel cascade serving framework designed explicitly to schedule request routing and deploy model cascades for fast, quality-preserving LLM serving. Cascadia employs a bi-level optimization method: at the deployment level, it uses a mixed-integer linear program to select resource allocations and parallelism strategies based on LLM information and workload characteristics; at the routing level, it applies a Chebyshev-guided method to iteratively co-optimize the routing strategy and the system deployment produced by the deployment level. Our extensive evaluation on diverse workload traces and different model cascades (DeepSeek and the Llama series) demonstrates that Cascadia significantly outperforms both single-model deployments and the state-of-the-art cascade serving baseline, achieving up to 4$\times$ (2.3$\times$ on average) tighter latency SLOs and up to 5$\times$ (2.4$\times$ on average) higher throughput while maintaining target answer quality.

Cascadia: An Efficient Cascade Serving System for Large Language Models

TL;DR

Cascadia tackles the challenge of efficiently serving large language models (LLMs) under latency and quality constraints by deploying cascaded models that trade off speed and accuracy. It introduces a bi-level optimization framework in which a MILP-based deployment solver determines resource allocations and parallelism across multiple cascade stages, while a Chebyshev-guided routing solver jointly optimizes request routing to meet a minimum quality while minimizing latency, using and alongside normalization by and . The approach is instantiated with DeepSeek and Llama cascades and evaluated against stand-alone LLMs and CascadeServe, achieving up to tighter latency SLOs ( avg) and higher throughput ( avg), with robust performance under fluctuating workloads via a re-scheduling mechanism. These results demonstrate the practical impact of co-designing deployment and routing for cascaded LLM serving, enabling faster, cost-efficient, quality-guaranteed responses in real-world deployments.

Abstract

Recent advances in large language models (LLMs) have intensified the need to deliver both rapid responses and high-quality outputs. More powerful models yield better results but incur higher inference latency, whereas smaller models are faster yet less capable. Recent work proposes balancing this latency-quality trade-off using model cascades, which route simpler queries to smaller models and more complex ones to larger models. However, enabling efficient cascade serving remains challenging. Current frameworks lack effective mechanisms for handling (i) the huge and varying resource demands of different LLMs, (ii) the inherent heterogeneity of LLM workloads, and (iii) the co-optimization of system deployment and routing strategy. Motivated by these observations, we introduce Cascadia, a novel cascade serving framework designed explicitly to schedule request routing and deploy model cascades for fast, quality-preserving LLM serving. Cascadia employs a bi-level optimization method: at the deployment level, it uses a mixed-integer linear program to select resource allocations and parallelism strategies based on LLM information and workload characteristics; at the routing level, it applies a Chebyshev-guided method to iteratively co-optimize the routing strategy and the system deployment produced by the deployment level. Our extensive evaluation on diverse workload traces and different model cascades (DeepSeek and the Llama series) demonstrates that Cascadia significantly outperforms both single-model deployments and the state-of-the-art cascade serving baseline, achieving up to 4 (2.3 on average) tighter latency SLOs and up to 5 (2.4 on average) higher throughput while maintaining target answer quality.

Paper Structure

This paper contains 18 sections, 8 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Average response quality and latencies of different DeepSeek models. Quality is judged by GPT-4o using the LLM-as-a-Judge framework zheng2023judging.
  • Figure 1: Bi-level Scheduling Workflow
  • Figure 2: Benchmarked performance of different parallelism strategies across different workloads and model sizes. Long and short outputs represent two different workloads with average output sequence length to be 512 and 1024; the three-element array represents the DP, TP, and PP degrees.
  • Figure 3: Illustration of a model deployment plan.
  • Figure 4: Threshold-based cascade routing workflow. The router determines whether a request is accepted or forwarded to the next model type based on predefined thresholds.
  • ...and 7 more figures