Cascadia: An Efficient Cascade Serving System for Large Language Models

Youhe Jiang; Fangcheng Fu; Wanru Zhao; Stephan Rabanser; Jintao Zhang; Nicholas D. Lane; Binhang Yuan

Cascadia: An Efficient Cascade Serving System for Large Language Models

Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Jintao Zhang, Nicholas D. Lane, Binhang Yuan

TL;DR

Cascadia tackles the challenge of efficiently serving large language models (LLMs) under latency and quality constraints by deploying cascaded models that trade off speed and accuracy. It introduces a bi-level optimization framework in which a MILP-based deployment solver determines resource allocations and parallelism across multiple cascade stages, while a Chebyshev-guided routing solver jointly optimizes request routing to meet a minimum quality while minimizing latency, using $L$ and $Q$ alongside normalization by $z_1^{\star}$ and $z_2^{\star}$. The approach is instantiated with DeepSeek and Llama cascades and evaluated against stand-alone LLMs and CascadeServe, achieving up to $4\times$ tighter latency SLOs ($2.3\times$ avg) and $5\times$ higher throughput ($2.4\times$ avg), with robust performance under fluctuating workloads via a re-scheduling mechanism. These results demonstrate the practical impact of co-designing deployment and routing for cascaded LLM serving, enabling faster, cost-efficient, quality-guaranteed responses in real-world deployments.

Abstract

Recent advances in large language models (LLMs) have intensified the need to deliver both rapid responses and high-quality outputs. More powerful models yield better results but incur higher inference latency, whereas smaller models are faster yet less capable. Recent work proposes balancing this latency-quality trade-off using model cascades, which route simpler queries to smaller models and more complex ones to larger models. However, enabling efficient cascade serving remains challenging. Current frameworks lack effective mechanisms for handling (i) the huge and varying resource demands of different LLMs, (ii) the inherent heterogeneity of LLM workloads, and (iii) the co-optimization of system deployment and routing strategy. Motivated by these observations, we introduce Cascadia, a novel cascade serving framework designed explicitly to schedule request routing and deploy model cascades for fast, quality-preserving LLM serving. Cascadia employs a bi-level optimization method: at the deployment level, it uses a mixed-integer linear program to select resource allocations and parallelism strategies based on LLM information and workload characteristics; at the routing level, it applies a Chebyshev-guided method to iteratively co-optimize the routing strategy and the system deployment produced by the deployment level. Our extensive evaluation on diverse workload traces and different model cascades (DeepSeek and the Llama series) demonstrates that Cascadia significantly outperforms both single-model deployments and the state-of-the-art cascade serving baseline, achieving up to 4$\times$ (2.3$\times$ on average) tighter latency SLOs and up to 5$\times$ (2.4$\times$ on average) higher throughput while maintaining target answer quality.

Cascadia: An Efficient Cascade Serving System for Large Language Models

TL;DR

Abstract

Cascadia: An Efficient Cascade Serving System for Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)