FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
Yanying Lin, Shijie Peng, Chengzhi Lu, Chengzhong Xu, Kejiang Ye
TL;DR
FlexPipe tackles the problem of efficiently serving large language models in production under highly dynamic request patterns and fragmented serverless clusters. It introduces inflight pipeline refactoring through three core innovations: fine-grained model partitioning, inflight topology reconfiguration with cache-consistent transitions, and topology-aware resource allocation. The approach is validated on real production-grade infrastructure, achieving up to 8.5× resource efficiency and 38.3% lower latency, with significant improvements in production GPU reservation and scalability during bursts. The work demonstrates that dynamic, fine-grained adaptability—rather than static optimization—substantially improves both efficiency and reliability in practical serverless LLM serving contexts.
Abstract
Serving Large Language Models (LLMs) in production faces significant challenges from highly variable request patterns and severe resource fragmentation in serverless clusters. Current systems rely on static pipeline configurations that struggle to adapt to dynamic workload conditions, leading to substantial inefficiencies. We present FlexPipe, a novel system that dynamically reconfigures pipeline architectures during runtime to address these fundamental limitations. FlexPipe decomposes models into fine-grained stages and intelligently adjusts pipeline granularity based on real-time request pattern analysis, implementing three key innovations: fine-grained model partitioning with preserved computational graph constraints, inflight pipeline refactoring with consistent cache transitions, and topology-aware resource allocation that navigates GPU fragmentation. Comprehensive evaluation on an 82-GPU cluster demonstrates that FlexPipe achieves up to 8.5x better resource efficiency while maintaining 38.3% lower latency compared to state-of-the-art systems, reducing GPU reservation requirements from 75% to 30% of peak capacity.
