Joint Training on AMD and NVIDIA GPUs

Jon Hu; Thomas Jia; Jing Zhu; Zhendong Yu

Joint Training on AMD and NVIDIA GPUs

Jon Hu, Thomas Jia, Jing Zhu, Zhendong Yu

TL;DR

The paper addresses the challenge of training large language models on heterogeneous AMD and NVIDIA GPUs. It introduces two approaches: a compatibility-baseline CPU-Forwarding Communication and a high-performance Device-Direct Communication that enables direct cross-vendor GPU data transfers via GPUDirect RDMA and CPU-offloading P2P. Experimental results on LLaMA-8B and Qwen2-7B show Device-Direct achieves up to 98% of the throughput of a NVIDIA homogeneous setup while maintaining stability and correctness. The work demonstrates that, with appropriate pipeline-parallel partitioning and engineering, AMD–NVIDIA heterogeneous clusters can efficiently support large-scale pre-training and effectively utilize diverse GPU resources.

Abstract

As large language models continue to scale, training demands on compute and system capacity grow rapidly, making single-vendor homogeneous clusters insufficient. This paper presents a technical solution for heterogeneous mixed training in AMD-NVIDIA environments. We first adopt a compatibility-oriented approach based on CPU-Forwarding Communication, with differentiated communication back-end selection across parallel groups and multi-NIC parallel data transfer. To achieve higher performance, we further propose another Device-Direct Communication approach, integrating a CPU-offloading P2P mechanism to enable direct cross-vendor GPU data transfer without host-memory staging. Experiments on LLaMA-8B and Qwen2-7B demonstrate that the proposed Device-Direct Communication approach achieves up to 98% of the throughput of an NVIDIA homogeneous system, while preserving training stability and correctness.

Joint Training on AMD and NVIDIA GPUs

TL;DR

Abstract

Paper Structure (16 sections, 1 equation, 5 figures, 1 table)

This paper contains 16 sections, 1 equation, 5 figures, 1 table.

Introduction
Approach
CPU-Forwarding Communication
Device-Direct Communication
Results and Analysis
Experimental Setup
Testbed
Workloads and Parallelization
Performance
Stability
Correctness
Discussion
Heterogeneity Limited to Pipeline Parallelism
Effect of Model Partitioning
Engineering Challenges
...and 1 more sections

Figures (5)

Figure 1: CPU-offloading P2P mechanism.
Figure 2: Heterogeneous mixed-training communication invocation workflow.
Figure 3: Average training throughput under different training configurations.
Figure 4: Throughput over 500 iterations of Llama-8B and Qwen2-7B.
Figure 5: Loss over 500 iterations of Llama-8B and Qwen2-7B.

Joint Training on AMD and NVIDIA GPUs

TL;DR

Abstract

Joint Training on AMD and NVIDIA GPUs

Authors

TL;DR

Abstract

Table of Contents

Figures (5)