Table of Contents
Fetching ...

GPU Memory Prediction for Multimodal Model Training

Jinwoo Jeong, Minchul Kang, Younghun Go, Changyong Shin, Hyunho Lee, Junho Yoon, Gyeongsik Yang, Chuck Yoo

TL;DR

This work tackles the problem of out-of-memory (OoM) errors during training of large multimodal models by introducing a per-layer memory predictor that factorizes memory consumption into four components: parameters, optimizer states, gradients, and activations. The framework parses the model into modality-specific modules, decomposes modules into layers, applies per-factor predictors, and aggregates results to estimate peak memory. On the LLaVA-1.5 7B model, it achieves about 8.7% average MAPE across diverse hyperparameters, with 13% in a broader setting, demonstrating robustness. This approach enables proactive memory provisioning for multimodal training and could be extended to optimization techniques and inference workloads to further mitigate OoM risks.

Abstract

As deep learning models in agentic AI systems grow in scale and complexity, GPU memory requirements increase and often exceed the available GPU memory capacity, so that out-of-memory (OoM) errors occur. It is well known that OoM interrupts the whole training itself and wastes substantial computational resources. Therefore, to prevent OoM, accurate prediction of GPU memory usage is essential. However, previous studies focus only on unimodal architectures and fail to generalize to multimodal models, even though the multimodal models are a common choice in agentic AI systems. To address this limitation, we propose a framework that predicts the peak GPU memory usage by analyzing the model architecture and training behavior of multimodal models. Specifically, the framework decomposes the multimodal model into its constituent layers and applies factorization to estimate the memory usage of each layer. Our evaluation shows that our framework achieves high prediction accuracy of ~8.7% average MAPE.

GPU Memory Prediction for Multimodal Model Training

TL;DR

This work tackles the problem of out-of-memory (OoM) errors during training of large multimodal models by introducing a per-layer memory predictor that factorizes memory consumption into four components: parameters, optimizer states, gradients, and activations. The framework parses the model into modality-specific modules, decomposes modules into layers, applies per-factor predictors, and aggregates results to estimate peak memory. On the LLaVA-1.5 7B model, it achieves about 8.7% average MAPE across diverse hyperparameters, with 13% in a broader setting, demonstrating robustness. This approach enables proactive memory provisioning for multimodal training and could be extended to optimization techniques and inference workloads to further mitigate OoM risks.

Abstract

As deep learning models in agentic AI systems grow in scale and complexity, GPU memory requirements increase and often exceed the available GPU memory capacity, so that out-of-memory (OoM) errors occur. It is well known that OoM interrupts the whole training itself and wastes substantial computational resources. Therefore, to prevent OoM, accurate prediction of GPU memory usage is essential. However, previous studies focus only on unimodal architectures and fail to generalize to multimodal models, even though the multimodal models are a common choice in agentic AI systems. To address this limitation, we propose a framework that predicts the peak GPU memory usage by analyzing the model architecture and training behavior of multimodal models. Specifically, the framework decomposes the multimodal model into its constituent layers and applies factorization to estimate the memory usage of each layer. Our evaluation shows that our framework achieves high prediction accuracy of ~8.7% average MAPE.

Paper Structure

This paper contains 6 sections, 1 equation, 2 figures.

Figures (2)

  • Figure 1: Workflow of the proposed framework.
  • Figure 2: GPU memory usage prediction results.