Table of Contents
Fetching ...

Semantic-Aware Scheduling for GPU Clusters with Large Language Models

Zerui Wang, Qinghao Hu, Ana Klimovic, Tianwei Zhang, Yonggang Wen, Peng Sun, Dahua Lin

TL;DR

The paper addresses the semantic gap in DL GPU cluster scheduling by exploiting unstructured data (source code, runtime logs, historical jobs) with LLMs. It proposes SchedMate, a plug-in framework composed of three modules—Scheduling Advisor for semantic workload metadata, Metric Tracker for progress observability, and Failure Handler for automated failure recovery—hooked into existing schedulers. Through physical-cluster experiments and extensive simulations, SchedMate delivers up to 1.91x reductions in average JCT, by enabling retrieval-based workload prediction, real-time progress monitoring, and rapid failure diagnosis and remediation. This semantic-aware approach reduces profiling overhead, improves observability, and enhances resilience to failures, offering practical gains for modern DL clusters.

Abstract

Deep learning (DL) schedulers are pivotal in optimizing resource allocation in GPU clusters, but operate with a critical limitation: they are largely blind to the semantic context of the jobs they manage. This forces them to rely on limited metadata, leading to high profiling overhead, unreliable duration estimation, inadequate failure handling, and poor observability. To this end, we propose SchedMate, a framework that bridges this semantic gap by systematically extracting deep insights from overlooked, unstructured data sources: source code, runtime logs, and historical jobs. SchedMate enhances existing schedulers non-intrusively through three LLM-based components. Our implementation integrates seamlessly with existing deep learning schedulers. Evaluations on a 128-GPU physical cluster and extensive simulations on production traces show SchedMate reduces average job completion times by up to 1.91x, substantially enhancing the scheduling performance, demonstrating the critical role of semantic-awareness in modern DL scheduling.

Semantic-Aware Scheduling for GPU Clusters with Large Language Models

TL;DR

The paper addresses the semantic gap in DL GPU cluster scheduling by exploiting unstructured data (source code, runtime logs, historical jobs) with LLMs. It proposes SchedMate, a plug-in framework composed of three modules—Scheduling Advisor for semantic workload metadata, Metric Tracker for progress observability, and Failure Handler for automated failure recovery—hooked into existing schedulers. Through physical-cluster experiments and extensive simulations, SchedMate delivers up to 1.91x reductions in average JCT, by enabling retrieval-based workload prediction, real-time progress monitoring, and rapid failure diagnosis and remediation. This semantic-aware approach reduces profiling overhead, improves observability, and enhances resilience to failures, offering practical gains for modern DL clusters.

Abstract

Deep learning (DL) schedulers are pivotal in optimizing resource allocation in GPU clusters, but operate with a critical limitation: they are largely blind to the semantic context of the jobs they manage. This forces them to rely on limited metadata, leading to high profiling overhead, unreliable duration estimation, inadequate failure handling, and poor observability. To this end, we propose SchedMate, a framework that bridges this semantic gap by systematically extracting deep insights from overlooked, unstructured data sources: source code, runtime logs, and historical jobs. SchedMate enhances existing schedulers non-intrusively through three LLM-based components. Our implementation integrates seamlessly with existing deep learning schedulers. Evaluations on a 128-GPU physical cluster and extensive simulations on production traces show SchedMate reduces average job completion times by up to 1.91x, substantially enhancing the scheduling performance, demonstrating the critical role of semantic-awareness in modern DL scheduling.

Paper Structure

This paper contains 20 sections, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Existing DL schedulers operate on limited metadata (top), while SchedMate enriches the scheduling process with deep insights from source code, logs, and historical jobs (bottom).
  • Figure 1: Physical Evaluation: Comparison of policy performance between physical and simulated environments.
  • Figure 2: Background: DL workload characteristics across Microsoft Philly Philly, SenseTime Helios Helios and Acme Acme clusters. (a) CDF of the job duration. (b, c) Final statuses of jobs in terms of quantity and utilized GPU resources.
  • Figure 3: Examples of semantic information available in source code and runtime logs that are opaque to traditional schedulers.
  • Figure 4: System Overview of SchedMate The system integrates with four data sources (prior jobs, source code, runtime log, and hardware metrics). The modules attached to robot symbols utilize LLMs.
  • ...and 11 more figures