Table of Contents
Fetching ...

Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

Reza Abbasi, Sernam Lim

TL;DR

Superpipeline is a new framework designed to optimize the execution of large AI models on constrained hardware during both training and inference, and enables the use of larger models or bigger batch sizes on existing hardware, potentially speeding up innovation across many machine learning applications.

Abstract

The rapid growth in machine learning models, especially in natural language processing and computer vision, has led to challenges when running these models on hardware with limited resources. This paper introduces Superpipeline, a new framework designed to optimize the execution of large AI models on constrained hardware during both training and inference. Our approach involves dynamically managing model execution by dividing models into individual layers and efficiently transferring these layers between GPU and CPU memory. Superpipeline reduces GPU memory usage by up to 60% in our experiments while maintaining model accuracy and acceptable processing speeds. This allows models that would otherwise exceed available GPU memory to run effectively. Unlike existing solutions that focus mainly on inference or specific model types, Superpipeline can be applied to large language models (LLMs), vision-language models (VLMs), and vision-based models. We tested Superpipeline's performance across various models and hardware setups. The method includes two key parameters that allow fine-tuning the balance between GPU memory use and processing speed. Importantly, Superpipeline does not require retraining or changing model parameters, ensuring that the original model's output remains unchanged. Superpipeline's simplicity and flexibility make it useful for researchers and professionals working with advanced AI models on limited hardware. It enables the use of larger models or bigger batch sizes on existing hardware, potentially speeding up innovation across many machine learning applications. This work marks an important step toward making advanced AI models more accessible and optimizing their deployment in resource-limited environments. The code for Superpipeline is available at https://github.com/abbasiReza/super-pipeline.

Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

TL;DR

Superpipeline is a new framework designed to optimize the execution of large AI models on constrained hardware during both training and inference, and enables the use of larger models or bigger batch sizes on existing hardware, potentially speeding up innovation across many machine learning applications.

Abstract

The rapid growth in machine learning models, especially in natural language processing and computer vision, has led to challenges when running these models on hardware with limited resources. This paper introduces Superpipeline, a new framework designed to optimize the execution of large AI models on constrained hardware during both training and inference. Our approach involves dynamically managing model execution by dividing models into individual layers and efficiently transferring these layers between GPU and CPU memory. Superpipeline reduces GPU memory usage by up to 60% in our experiments while maintaining model accuracy and acceptable processing speeds. This allows models that would otherwise exceed available GPU memory to run effectively. Unlike existing solutions that focus mainly on inference or specific model types, Superpipeline can be applied to large language models (LLMs), vision-language models (VLMs), and vision-based models. We tested Superpipeline's performance across various models and hardware setups. The method includes two key parameters that allow fine-tuning the balance between GPU memory use and processing speed. Importantly, Superpipeline does not require retraining or changing model parameters, ensuring that the original model's output remains unchanged. Superpipeline's simplicity and flexibility make it useful for researchers and professionals working with advanced AI models on limited hardware. It enables the use of larger models or bigger batch sizes on existing hardware, potentially speeding up innovation across many machine learning applications. This work marks an important step toward making advanced AI models more accessible and optimizing their deployment in resource-limited environments. The code for Superpipeline is available at https://github.com/abbasiReza/super-pipeline.

Paper Structure

This paper contains 24 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Superpipeline Diagram. Comparison of model execution strategies: Standard (all layers on GPU), Naive ($k=2$), and Superpipeline ($k=4, k'=2$). $k$ represents layers simultaneously on GPU. $k'$ denotes layers transferred back to CPU after computation, and simultaneously, the number of next layers moved to GPU. Superpipeline optimizes GPU memory usage through this dynamic layer management.
  • Figure 2: Comparison of Memory Usage and Speed during ViT-BigG Training on ImageNet-Tiny, with a Batch Size of 16 for All Scenarios.
  • Figure 3: Comparison of layer transfer times between GPU and CPU for ViT-bigG and Llama2 models.
  • Figure 4: Comparison of Sequential and Batch Transfer Strategies
  • Figure 5: Performance comparison of Sequential vs. Batch Transfer strategies