Table of Contents
Fetching ...

GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations

Ehsan Yousefzadeh-Asl-Miandoab, Reza Karimzadeh, Danyal Yorulmaz, Bulat Ibragimov, Pınar Tözün

TL;DR

A systematic analysis of representative estimators from each paradigm -- Horus, PyTorch FakeTensor, and the authors' lightweight ML-based estimator -- evaluating accuracy, generalizability, and practical overhead reveals key tradeoffs and validates estimators against real-world unseen models.

Abstract

Collocating deep learning training tasks improves GPU utilization but causes drastic slowdowns due to resource contention and risks Out-of-Memory (OOM) failures. Accurate memory estimation is essential for robust collocation, while GPU utilization -- a key proxy for resource contention -- enables interference-aware scheduling to reduce slowdowns and improve throughput. Existing GPU memory estimators span three paradigms -- analytical models, CPU-side libraries, and ML-based estimators -- each with distinct limitations: dependence on detailed model specifications, intrusive integration, poor generalization, and varying latency overhead. GPU heterogeneity further complicates estimation, as identical tasks can exhibit markedly different memory footprints across hardware generations. GPU utilization remains comparatively understudied, further complicated by the non-additive nature of utilization metrics and hardware sensitivity. We conduct a systematic analysis of representative estimators from each paradigm -- Horus, PyTorch FakeTensor, and our lightweight ML-based estimator -- evaluating accuracy, generalizability, and practical overhead. We construct a synthetic dataset spanning MLPs, CNNs, and Transformers with controlled architectural variations, and train MLP- and Transformer-based estimators for memory prediction. We further experiment with utilization estimation on the same dataset. Our evaluation reveals key tradeoffs and validates estimators against real-world unseen models. Significant challenges remain: analytical models are hardware-dependent, CPU-side libraries impose intrusive integration costs, and ML-based estimators struggle with cross-architecture generalization. We release all datasets, tools, and artifacts to support further research.

GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations

TL;DR

A systematic analysis of representative estimators from each paradigm -- Horus, PyTorch FakeTensor, and the authors' lightweight ML-based estimator -- evaluating accuracy, generalizability, and practical overhead reveals key tradeoffs and validates estimators against real-world unseen models.

Abstract

Collocating deep learning training tasks improves GPU utilization but causes drastic slowdowns due to resource contention and risks Out-of-Memory (OOM) failures. Accurate memory estimation is essential for robust collocation, while GPU utilization -- a key proxy for resource contention -- enables interference-aware scheduling to reduce slowdowns and improve throughput. Existing GPU memory estimators span three paradigms -- analytical models, CPU-side libraries, and ML-based estimators -- each with distinct limitations: dependence on detailed model specifications, intrusive integration, poor generalization, and varying latency overhead. GPU heterogeneity further complicates estimation, as identical tasks can exhibit markedly different memory footprints across hardware generations. GPU utilization remains comparatively understudied, further complicated by the non-additive nature of utilization metrics and hardware sensitivity. We conduct a systematic analysis of representative estimators from each paradigm -- Horus, PyTorch FakeTensor, and our lightweight ML-based estimator -- evaluating accuracy, generalizability, and practical overhead. We construct a synthetic dataset spanning MLPs, CNNs, and Transformers with controlled architectural variations, and train MLP- and Transformer-based estimators for memory prediction. We further experiment with utilization estimation on the same dataset. Our evaluation reveals key tradeoffs and validates estimators against real-world unseen models. Significant challenges remain: analytical models are hardware-dependent, CPU-side libraries impose intrusive integration costs, and ML-based estimators struggle with cross-architecture generalization. We release all datasets, tools, and artifacts to support further research.
Paper Structure (21 sections, 7 figures, 7 tables)

This paper contains 21 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Actual GPU memory need vs Horus' estimations for MLP models with varying number of neurons and layers.
  • Figure 2: YAML configuration for MLP generation
  • Figure 3: Staircase growth pattern for memory usage, MLPs on ImageNet ILSVRC15 and with batch_size=32.
  • Figure 4: Principal Component Analysis (PCA) of the dataset across different neural network architectures. The figure shows how discretizing the continuous GPU memory usage facilitates formulating the problem as a classification task.
  • Figure 5: GPU memory estimation for real-world unseen CNN and Transformer models using Horus, FakeTensor, and GPUMemNet. FakeTensor fails at Transformer models and GPUMemNet cannot estimate for the unseen model, e.g., DLRM (denoted with X). GPUMemNet provides the closest estimations to actual GPU memory consumption and almost never underestimates.
  • ...and 2 more figures