Table of Contents
Fetching ...

CARMA: Collocation-Aware Resource Manager

Ehsan Yousefzadeh-Asl-Miandoab, Reza Karimzadeh, Bulat Ibragimov, Florina M. Ciorba, Pınar Tözün

TL;DR

CARMA tackles GPU underutilization in deep learning training by enabling task-level collocation with careful OOM and interference control. It combines fine-grained GPU telemetry, memory-need estimators, risk-based placement policies, and a lightweight OOM-recovery mechanism to improve utilization while maintaining QoS. The approach yields significant gains in SM utilization, memory efficiency, and throughput, with meaningful reductions in makespan and energy consumption on production-like traces. Its server-scale focus and recovery-first design make collocation more robust for real-world DL workloads, and the framework lays groundwork for broader adoption and future enhancements in multi-server settings and inference workloads.

Abstract

GPUs running deep learning (DL) workloads are frequently underutilized. Collocating multiple DL training tasks on the same GPU can improve utilization but introduces two key risks: (1) out-of-memory (OOM) crashes for newly scheduled tasks, and (2) severe performance interference among co-running tasks, which can negate any throughput gains. These issues reduce system robustness, quality of service, and energy efficiency. We present CARMA, a task-level, collocation-aware resource management system for the server-scale. CARMA addresses collocation challenges via (1) fine-grained monitoring and bookkeeping of GPUs and a collocation risk analysis that filters out the high-risk GPUs; (2) task placement policies that cap GPU utilization to avoid OOMs and limit interference; (3) integration of GPU memory need estimators for DL tasks to minimize OOMs during collocation; and (4) a lightweight recovery method that relaunches jobs crashed due to OOMs. Our evaluation on a DL training workload derived from real-world traces shows that CARMA uses GPUs more efficiently by making more informed collocation decisions: for the best-performing collocation policy, CARMA increases GPU streaming multiprocessor (SM) utilization by 54%, the parallelism achieved per SM by 61%, and memory use by 62%. This results in a $\sim$35% and $\sim$15% reduction in the end-to-end execution time (makespan) and GPU energy consumption, respectively, for this workload.

CARMA: Collocation-Aware Resource Manager

TL;DR

CARMA tackles GPU underutilization in deep learning training by enabling task-level collocation with careful OOM and interference control. It combines fine-grained GPU telemetry, memory-need estimators, risk-based placement policies, and a lightweight OOM-recovery mechanism to improve utilization while maintaining QoS. The approach yields significant gains in SM utilization, memory efficiency, and throughput, with meaningful reductions in makespan and energy consumption on production-like traces. Its server-scale focus and recovery-first design make collocation more robust for real-world DL workloads, and the framework lays groundwork for broader adoption and future enhancements in multi-server settings and inference workloads.

Abstract

GPUs running deep learning (DL) workloads are frequently underutilized. Collocating multiple DL training tasks on the same GPU can improve utilization but introduces two key risks: (1) out-of-memory (OOM) crashes for newly scheduled tasks, and (2) severe performance interference among co-running tasks, which can negate any throughput gains. These issues reduce system robustness, quality of service, and energy efficiency. We present CARMA, a task-level, collocation-aware resource management system for the server-scale. CARMA addresses collocation challenges via (1) fine-grained monitoring and bookkeeping of GPUs and a collocation risk analysis that filters out the high-risk GPUs; (2) task placement policies that cap GPU utilization to avoid OOMs and limit interference; (3) integration of GPU memory need estimators for DL tasks to minimize OOMs during collocation; and (4) a lightweight recovery method that relaunches jobs crashed due to OOMs. Our evaluation on a DL training workload derived from real-world traces shows that CARMA uses GPUs more efficiently by making more informed collocation decisions: for the best-performing collocation policy, CARMA increases GPU streaming multiprocessor (SM) utilization by 54%, the parallelism achieved per SM by 61%, and memory use by 62%. This results in a 35% and 15% reduction in the end-to-end execution time (makespan) and GPU energy consumption, respectively, for this workload.

Paper Structure

This paper contains 29 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Actual GPU memory need vs Horus' estimations for MLP models with varying number of neurons and layers.
  • Figure 2: GPU memory estimation for real-world unseen CNN and Transformer models using Horus, FakeTensor, and GPUMemNet. FakeTensor fails at Transformer models and GPUMemNet cannot estimate for the unseen model, e.g., DLRM (denoted with X). GPUMemNet provides the closest estimations to actual GPU memory consumption and almost never underestimates.
  • Figure 3: Overview of CARMA.
  • Figure 4: Trace total time (makespan) for the first trace with a variety of collocation policies.
  • Figure 5: p95 JCT, waiting time, and execution time (minutes) for the first trace across collocation policies.
  • ...and 8 more figures