Table of Contents
Fetching ...

Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements

Yuting Yang, Tiancheng Yuan, Jamal Hashim, Thiago Garrett, Jeffrey Qian, Ann Zhang, Yifan Wang, Weijia Song, Ken Birman

TL;DR

Vortex tackles the challenge of hosting ML inference and knowledge retrieval as services with strict latency targets by adopting an SLO-first design for componentized pipelines. It integrates a DLL-based architecture atop a key-value store, enables data locality through affinity grouping, and leverages zero-copy data paths and RDMA when available to minimize tail latency while maintaining high throughput. The approach employs opportunistic batching, anticipatory model preloading, and elastic, shard-aligned microservice deployments to sustain SLOs across varying load, outperforming TorchServe and, in latency-sensitive regimes, Ray Serve (with RDMA providing additional gains). Empirical results on two representative pipelines show substantial throughput improvements and tighter latency distributions, suggesting that end-to-end ML services can meet application-driven latency guarantees without sacrificing throughput or cost efficiency. This work offers practical, deployable strategies for real-world AI agents and interactive ML-powered applications that require both responsiveness and high request rates.

Abstract

There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex's pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.

Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements

TL;DR

Vortex tackles the challenge of hosting ML inference and knowledge retrieval as services with strict latency targets by adopting an SLO-first design for componentized pipelines. It integrates a DLL-based architecture atop a key-value store, enables data locality through affinity grouping, and leverages zero-copy data paths and RDMA when available to minimize tail latency while maintaining high throughput. The approach employs opportunistic batching, anticipatory model preloading, and elastic, shard-aligned microservice deployments to sustain SLOs across varying load, outperforming TorchServe and, in latency-sensitive regimes, Ray Serve (with RDMA providing additional gains). Empirical results on two representative pipelines show substantial throughput improvements and tighter latency distributions, suggesting that end-to-end ML services can meet application-driven latency guarantees without sacrificing throughput or cost efficiency. This work offers practical, deployable strategies for real-world AI agents and interactive ML-powered applications that require both responsiveness and high request rates.

Abstract

There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex's pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.

Paper Structure

This paper contains 26 sections, 16 figures.

Figures (16)

  • Figure 1: Two Representative ML-as-a-Service Pipelines
  • Figure 2: Vortex System Architecture
  • Figure 3: Stage to Stage Handoffs
  • Figure 4: Resource requirements of PreFLMR components. We vary the batch size and show throughput and GPU memory usage.
  • Figure 5: Resource packing comparison for two PreFLMR deployment options. With monolithic deployments (top), PreFLMR performs 8 queries in the time period shown. The microservice option (bottom) enables scheduling flexibility: by running visual embedding (B) on three nodes and the remaining components (A,C,D) on the fourth, throughput rises to 15 queries in the same time period.
  • ...and 11 more figures