Table of Contents
Fetching ...

Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows

Yuting Yang, Andrea Merlina, Weijia Song, Tiancheng Yuan, Ken Birman, Roman Vitenberg

TL;DR

Compass addresses latency-sensitive, DAG-structured ML workflows on edge clusters by co-designing decentralized scheduling and GPU memory caching. It introduces a two-phase planning and dynamic adjustment strategy that accounts for data locality, GPU cache contents, and inter-task dependencies, all within a fully decentralized, Derecho-backed state-sharing fabric. Empirical results show 2x–6x reductions in end-to-end latency with equal or fewer resources, and even half the servers suffice for the same workload, illustrating strong practicality for edge deployments. The approach enables high cache hit rates and efficient use of GPU memory, offering scalable performance across bursty and production-like traces.

Abstract

We consider ML query processing in distributed systems where GPU-enabled workers coordinate to execute complex queries: a computing style often seen in applications that interact with users in support of image processing and natural language processing. In such systems, coscheduling of GPU memory management and task placement represents a promising opportunity. We propose Compass, a novel framework that unifies these functions to reduce job latency while using resources efficiently, placing tasks where data dependencies will be satisfied, collocating tasks from the same job (when this will not overload the host or its GPU), and efficiently managing GPU memory. Comparison with other state of the art schedulers shows a significant reduction in completion times while requiring the same amount or even fewer resources. In one case, just half the servers were needed for processing the same workload.

Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows

TL;DR

Compass addresses latency-sensitive, DAG-structured ML workflows on edge clusters by co-designing decentralized scheduling and GPU memory caching. It introduces a two-phase planning and dynamic adjustment strategy that accounts for data locality, GPU cache contents, and inter-task dependencies, all within a fully decentralized, Derecho-backed state-sharing fabric. Empirical results show 2x–6x reductions in end-to-end latency with equal or fewer resources, and even half the servers suffice for the same workload, illustrating strong practicality for edge deployments. The approach enables high cache hit rates and efficient use of GPU memory, offering scalable performance across bursty and production-like traces.

Abstract

We consider ML query processing in distributed systems where GPU-enabled workers coordinate to execute complex queries: a computing style often seen in applications that interact with users in support of image processing and natural language processing. In such systems, coscheduling of GPU memory management and task placement represents a promising opportunity. We propose Compass, a novel framework that unifies these functions to reduce job latency while using resources efficiently, placing tasks where data dependencies will be satisfied, collocating tasks from the same job (when this will not overload the host or its GPU), and efficiently managing GPU memory. Comparison with other state of the art schedulers shows a significant reduction in completion times while requiring the same amount or even fewer resources. In one case, just half the servers were needed for processing the same workload.
Paper Structure (38 sections, 5 equations, 10 figures, 1 table, 2 algorithms)

This paper contains 38 sections, 5 equations, 10 figures, 1 table, 2 algorithms.

Figures (10)

  • Figure 1: Pipelines
  • Figure 2: Worker Components in Compass
  • Figure 3: Example of Job Instance Handling
  • Figure 4: Network Transfer between Nodes
  • Figure 5: Compass Shared State Table (SST)
  • ...and 5 more figures