Table of Contents
Fetching ...

Efficient allocation of image recognition and LLM tasks on multi-GPU system

Marcin Lawenda, Krzesimir Samborski, Kyrylo Khloponin, Łukasz Szustak

TL;DR

The paper addresses the challenge of efficiently allocating image recognition and LLM tasks on multi-GPU systems. It uses profiling-driven analysis with TorchTune and Nsight Systems to compare data-parallel and distributed strategies, examining precision and memory optimizations across MobileNet v2 and LLama3-8B LoRA tuning. Key findings show that distributed sampling (DDP-DS) scales well for image workloads, and reduced precision (FP32/FP16) yields substantial speedups with preserved accuracy, while pin_memory offers moderate gains that diminish with more GPUs; LLM tuning experiences smaller iteration-time gains with additional GPUs and is dominated by kernel launches and synchronization rather than memory transfers. The results provide practical guidance for designing scalable multi-GPU ML pipelines and highlight where performance gains are most effectively realized in both vision and language workloads.

Abstract

This work is concerned with the evaluation of the performance of parallelization of learning and tuning processes for image classification and large language models. For machine learning model in image recognition, various parallelization methods are developed based on different hardware and software scenarios: simple data parallelism, distributed data parallelism, and distributed processing. A detailed description of presented strategies is given, highlighting the challenges and benefits of their application. Furthermore, the impact of different dataset types on the tuning process of large language models is investigated. Experiments show to what extent the task type affects the iteration time in a multi-GPU environment, offering valuable insights into the optimal data utilization strategies to improve model performance. Furthermore, this study leverages the built-in parallelization mechanisms of PyTorch that can facilitate these tasks. Furthermore, performance profiling is incorporated into the study to thoroughly evaluate the impact of memory and communication operations during the training/tuning procedure. Test scenarios are developed and tested with numerous benchmarks on the NVIDIA H100 architecture showing efficiency through selected metrics.

Efficient allocation of image recognition and LLM tasks on multi-GPU system

TL;DR

The paper addresses the challenge of efficiently allocating image recognition and LLM tasks on multi-GPU systems. It uses profiling-driven analysis with TorchTune and Nsight Systems to compare data-parallel and distributed strategies, examining precision and memory optimizations across MobileNet v2 and LLama3-8B LoRA tuning. Key findings show that distributed sampling (DDP-DS) scales well for image workloads, and reduced precision (FP32/FP16) yields substantial speedups with preserved accuracy, while pin_memory offers moderate gains that diminish with more GPUs; LLM tuning experiences smaller iteration-time gains with additional GPUs and is dominated by kernel launches and synchronization rather than memory transfers. The results provide practical guidance for designing scalable multi-GPU ML pipelines and highlight where performance gains are most effectively realized in both vision and language workloads.

Abstract

This work is concerned with the evaluation of the performance of parallelization of learning and tuning processes for image classification and large language models. For machine learning model in image recognition, various parallelization methods are developed based on different hardware and software scenarios: simple data parallelism, distributed data parallelism, and distributed processing. A detailed description of presented strategies is given, highlighting the challenges and benefits of their application. Furthermore, the impact of different dataset types on the tuning process of large language models is investigated. Experiments show to what extent the task type affects the iteration time in a multi-GPU environment, offering valuable insights into the optimal data utilization strategies to improve model performance. Furthermore, this study leverages the built-in parallelization mechanisms of PyTorch that can facilitate these tasks. Furthermore, performance profiling is incorporated into the study to thoroughly evaluate the impact of memory and communication operations during the training/tuning procedure. Test scenarios are developed and tested with numerous benchmarks on the NVIDIA H100 architecture showing efficiency through selected metrics.

Paper Structure

This paper contains 40 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Distributed learning algorithm performance scaling, FP32, $pin\_memory = FALSE$
  • Figure 2: Execution time (left) and efficiency (right) graphs for image sizes from 100x100 to 500x500, FP64, $pin\_memory = false$
  • Figure 3: Diagram with 4 metrics times from NSight System in percent
  • Figure 4: Dataset tuning performance scaling
  • Figure 5: Cuda API calls time for tested datasets
  • ...and 1 more figures