Table of Contents
Fetching ...

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, Matthew D. Sinclair

TL;DR

T3 addresses the bottleneck of serialized all-reduce in tensor-parallel Transformer workloads by introducing a hardware-software co-design that transparently overlaps communication with producer compute. It combines a lightweight track-and-trigger mechanism, pre-programmed DMA transfers, and near-memory compute to fuse and schedule data movement with minimal GPU-CU contention, significantly reducing DRAM traffic. The approach yields substantial per-layer speedups (up to 39%–47% in extended configurations) and 22% data-movement reductions, with pronounced gains as model size grows and TP degrees increase. These results demonstrate practical scalability gains for large Transformer models in both training and inference, and suggest broad applicability to other collectives, data-pipeline forms, and future hardware, while reducing the need for invasive kernel changes.

Abstract

Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in $\sim$500-billion parameter models, PALM and MT-NLG.

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

TL;DR

T3 addresses the bottleneck of serialized all-reduce in tensor-parallel Transformer workloads by introducing a hardware-software co-design that transparently overlaps communication with producer compute. It combines a lightweight track-and-trigger mechanism, pre-programmed DMA transfers, and near-memory compute to fuse and schedule data movement with minimal GPU-CU contention, significantly reducing DRAM traffic. The approach yields substantial per-layer speedups (up to 39%–47% in extended configurations) and 22% data-movement reductions, with pronounced gains as model size grows and TP degrees increase. These results demonstrate practical scalability gains for large Transformer models in both training and inference, and suggest broad applicability to other collectives, data-pipeline forms, and future hardware, while reducing the need for invasive kernel changes.

Abstract

Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in 500-billion parameter models, PALM and MT-NLG.
Paper Structure (45 sections, 20 figures, 3 tables)

This paper contains 45 sections, 20 figures, 3 tables.

Figures (20)

  • Figure 1: T3 overview.
  • Figure 2: (a) Transformer (b) Fully-connected (FC) layer (c) Tensor-sliced FC layer with all-Reduce on the critical path.
  • Figure 3: Ring implementation of reduce-scatter collective.
  • Figure 4: Transformer time spent on reduce-scatter (RS) and all-gather (AG) collectives as well as GEMMs which require collectives.
  • Figure 5: GEMM (left) when sliced in the dot-product dimension (right) still generates the same number of data blocks.
  • ...and 15 more figures