Design and Operation of Shared Machine Learning Clusters on Campus
Kaiqiang Xu, Decang Sun, Hao Wang, Zhenghang Ren, Xinchen Wan, Xudong Liao, Zilong Wang, Junxue Zhang, Kai Chen
TL;DR
This paper tackles the challenge of efficiently managing shared GPU clusters for academic ML research. It introduces SING, a four-layer, full-stack campus cluster manager that decouples job profiling, adaptation, scheduling, and execution, built atop Slurm with ML-specific optimizations. SING achieves high resource utilization through an optimized FCFS with backfill, pre-initialized environments, and two-dimensional resource granularity, while offering an MLaaS-style user experience and interactive debugging. The authors provide extensive operational data, discuss design choices and limitations, and release open-source resources including code, configurations, and a rich job trace to facilitate deployment of similar campus clusters and guide future work in shared ML infrastructure.
Abstract
Amid the rapid advancements in large machine learning (ML) models, universities worldwide are investing substantial funds and efforts into GPU clusters. However, managing a shared GPU cluster poses a pyramid of challenges, from hardware configuration to resource allocation among users. This paper introduces SING, a full-stack solution designed to streamline the management of shared GPU clusters in academic institutions. Motivated by the pressing need for efficient resource sharing and the challenges posed by limited staffing, we present a comprehensive view of SING's architecture and design choices, which achieves operational efficiency (i.e., low maintenance cost and high resource utilization). We also share experience and insights from the real-world operations of SING, including analysis of its usage patterns and management of incidents and failures. This paper is part of our ongoing effort to improve the management of shared ML clusters. We open-source relevant resources to facilitate the development and operation of similar clusters for ML.
