Table of Contents
Fetching ...

Profiling and optimization of multi-card GPU machine learning jobs

Marcin Lawenda, Kyrylo Khloponin, Krzesimir Samborski, Łukasz Szustak

TL;DR

This study tackles the optimization of multi-GPU ML workloads by systematically evaluating precision, memory management, data loading, and tensor layouts on a high-end NVIDIA GPU cluster. It combines image recognition profiling with LLM fine-tuning to reveal how techniques like mixed precision ($FP32$/$FP16$), pinned memory, and NHWC data layouts impact throughput, memory transfers, and synchronization overhead. The results show substantial speedups from reduced precision (up to ~$FP16$), modest gains from pin_memory, and clear advantages for NHWC/DALI over default DataLoader in NUMA environments, with DDP-DS further improving GPU utilization. For LLM tuning, LoRA generally yields faster iteration times than DPO, while QLoRA and QAT trade speed for memory efficiency and quantization benefits, highlighting important trade-offs for practical deployment on multi-GPU systems. Overall, the work provides actionable guidance for optimizing multi-GPU ML workloads in heterogeneous memory and topology settings, emphasizing precision strategies, data-loading pipelines, and framework choices to maximize performance and scalability.

Abstract

The effectiveness and efficiency of machine learning methodologies are crucial, especially with respect to the quality of results and computational cost. This paper discusses different model optimization techniques, providing a comprehensive analysis of key performance indicators. Several parallelization strategies for image recognition, adapted to different hardware and software configurations, including distributed data parallelism and distributed hardware processing, are analyzed. Selected optimization strategies are studied in detail, highlighting the related challenges and advantages of their implementation. Furthermore, the impact of different performance improvement techniques (DPO, LoRA, QLoRA, and QAT) on the tuning process of large language models is investigated. Experimental results illustrate how the nature of the task affects the iteration time in a multiprocessor environment, VRAM utilization, and overall memory transfers. Test scenarios are evaluated on the modern NVIDIA H100 GPU architecture.

Profiling and optimization of multi-card GPU machine learning jobs

TL;DR

This study tackles the optimization of multi-GPU ML workloads by systematically evaluating precision, memory management, data loading, and tensor layouts on a high-end NVIDIA GPU cluster. It combines image recognition profiling with LLM fine-tuning to reveal how techniques like mixed precision (/), pinned memory, and NHWC data layouts impact throughput, memory transfers, and synchronization overhead. The results show substantial speedups from reduced precision (up to ~), modest gains from pin_memory, and clear advantages for NHWC/DALI over default DataLoader in NUMA environments, with DDP-DS further improving GPU utilization. For LLM tuning, LoRA generally yields faster iteration times than DPO, while QLoRA and QAT trade speed for memory efficiency and quantization benefits, highlighting important trade-offs for practical deployment on multi-GPU systems. Overall, the work provides actionable guidance for optimizing multi-GPU ML workloads in heterogeneous memory and topology settings, emphasizing precision strategies, data-loading pipelines, and framework choices to maximize performance and scalability.

Abstract

The effectiveness and efficiency of machine learning methodologies are crucial, especially with respect to the quality of results and computational cost. This paper discusses different model optimization techniques, providing a comprehensive analysis of key performance indicators. Several parallelization strategies for image recognition, adapted to different hardware and software configurations, including distributed data parallelism and distributed hardware processing, are analyzed. Selected optimization strategies are studied in detail, highlighting the related challenges and advantages of their implementation. Furthermore, the impact of different performance improvement techniques (DPO, LoRA, QLoRA, and QAT) on the tuning process of large language models is investigated. Experimental results illustrate how the nature of the task affects the iteration time in a multiprocessor environment, VRAM utilization, and overall memory transfers. Test scenarios are evaluated on the modern NVIDIA H100 GPU architecture.

Paper Structure

This paper contains 39 sections, 17 figures, 5 tables.

Figures (17)

  • Figure 1: NUMA architecture of PROXIMA cluster nodes
  • Figure 2: Execution time (left) and efficiency (right) graphs for image sizes from 100x100 to 500x500, FP64, $pin\_memory = false$
  • Figure 3: Diagram with four metrics times from NSight System presented in percent
  • Figure 4: Share of metrics in the calculation time considering different test configurations including: image size, number of GPU and $pin\_memory$ parameter
  • Figure 5: Loss curve for 6 tests with different precision (double, float, half) and $pin\_memory$ (TRUE, FALSE)
  • ...and 12 more figures