Table of Contents
Fetching ...

DynamiQ: Accelerating Gradient Synchronization using Compressed Multi-hop All-reduce

Wenchen Han, Shay Vargaftik, Michael Mitzenmacher, Ran Ben Basat

TL;DR

DynamiQ addresses the gradient synchronization bottleneck in multi-hop all-reduce for large-scale LLM training by introducing a grouped, non-uniform quantization framework with a decompress-accumulate-recompress fused kernel. It combines per-super-group statistics, variable bitwidth allocation, hierarchical scaling, and correlated rounding, implemented as fused CUDA kernels within a PyTorch DDP and NCCL-based back-end. Across ring and butterfly topologies and diverse workloads, it achieves up to $34.2\%$ faster time-to-accuracy while maintaining near BF16 final accuracy (e.g., $99.9\%$ of BF16) and demonstrates robustness under network contention and scaling up to dozens of workers. The work provides a practical, hardware-conscious solution for efficient gradient compression in multi-hop all-reduce, with open-source plans and broad applicability to LLM training at scale.

Abstract

Multi-hop all-reduce is the de facto backbone of large model training. As the training scale increases, the network often becomes a bottleneck, motivating reducing the volume of transmitted data. Accordingly, recent systems demonstrated significant acceleration of the training process using gradient quantization. However, these systems are not optimized for multi-hop aggregation, where entries are partially summed multiple times along their aggregation topology. This paper presents DynamiQ, a quantization framework that bridges the gap between quantization best practices and multi-hop aggregation. DynamiQ introduces novel techniques to better represent partial sums, co-designed with a decompress-accumulate-recompress fused kernel to facilitate fast execution. We extended PyTorch DDP to support DynamiQ over NCCL P2P, and across different LLMs, tasks, and scales, we demonstrate consistent improvement of up to 34.2% over the best among state-of-the-art methods such as Omni-Reduce, THC, and emerging standards such as MXFP4, MXFP6, and MXFP8. Further, DynamiQ is the only evaluated method that consistently reaches near-baseline accuracy (e.g., 99.9% of the BF16 baseline) and does so while significantly accelerating the training.

DynamiQ: Accelerating Gradient Synchronization using Compressed Multi-hop All-reduce

TL;DR

DynamiQ addresses the gradient synchronization bottleneck in multi-hop all-reduce for large-scale LLM training by introducing a grouped, non-uniform quantization framework with a decompress-accumulate-recompress fused kernel. It combines per-super-group statistics, variable bitwidth allocation, hierarchical scaling, and correlated rounding, implemented as fused CUDA kernels within a PyTorch DDP and NCCL-based back-end. Across ring and butterfly topologies and diverse workloads, it achieves up to faster time-to-accuracy while maintaining near BF16 final accuracy (e.g., of BF16) and demonstrates robustness under network contention and scaling up to dozens of workers. The work provides a practical, hardware-conscious solution for efficient gradient compression in multi-hop all-reduce, with open-source plans and broad applicability to LLM training at scale.

Abstract

Multi-hop all-reduce is the de facto backbone of large model training. As the training scale increases, the network often becomes a bottleneck, motivating reducing the volume of transmitted data. Accordingly, recent systems demonstrated significant acceleration of the training process using gradient quantization. However, these systems are not optimized for multi-hop aggregation, where entries are partially summed multiple times along their aggregation topology. This paper presents DynamiQ, a quantization framework that bridges the gap between quantization best practices and multi-hop aggregation. DynamiQ introduces novel techniques to better represent partial sums, co-designed with a decompress-accumulate-recompress fused kernel to facilitate fast execution. We extended PyTorch DDP to support DynamiQ over NCCL P2P, and across different LLMs, tasks, and scales, we demonstrate consistent improvement of up to 34.2% over the best among state-of-the-art methods such as Omni-Reduce, THC, and emerging standards such as MXFP4, MXFP6, and MXFP8. Further, DynamiQ is the only evaluated method that consistently reaches near-baseline accuracy (e.g., 99.9% of the BF16 baseline) and does so while significantly accelerating the training.
Paper Structure (25 sections, 13 equations, 18 figures, 6 tables)

This paper contains 25 sections, 13 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: $\ell_2$ norm distributions of the gradients and their random shuffle for groups of size $16$ and super-groups of size $256$. The detailed experimental setups appear in Section \ref{['sec:eval']}.
  • Figure 2: The DynamiQ workflow: (a) workers first compute the metadata (mean and $\ell_2$ norm) for each of their super-groups; (b) a lightweight all-reduce call aggregated the metadata such that all workers know that global super-group means and sum of $\ell_2$ norms; (c) based on the aggregated metadata, each worker normalizes each super-group by subtracting its global mean and reorders the super-groups based on their bit width which is based on the $\ell_2$ norms. Notice that in this example, SG3 has lower bit width than SG2 and thus their places are swapped; (d) illustrates how the blue worker operates during the main all reduce. It invokes the fused kernel to first decompress the received compressed partial sums data from the green worker, accumulates its local data, and recompresses the result before sending it to the red worker; (e) after the main all-reduce terminates, all workers have the same aggregated sums; (f) each worker adds back the global mean of each super-group and orders the data back to obtain the synced gradient.
  • Figure 3: The CDF distribution of $F_j$, summed $\ell_2$ squared norm per super-group across workers. The vertical dashed lines are thresholds for our variable bitwidth allocation algorithms, where super-groups with larger $\ell_2$ norms are assigned more bits in one of $2, 4$, or $8$ bits.
  • Figure 4: Time-to-target perplexity and accuracy for training (fine-tuning) LLMs on 8-GPU/4-worker testbed using ring all-reduce. We measure the time required relative to BF16 (lower is better) to reach specific convergence targets defined by BF16's final metrics (perplexities of $3.107$, $2.996$, $3.095$ and accuracy of $73.04\%$). For example, for BERT-large, "105%" means we measure the time it takes to reach the perplexity of $3.107*1.05 \approx 3.22$, and for LLaMA 1B MMLU, "99%" means we measure the time it takes to reach the accuracy of $73.04 \cdot 0.99 \approx 72.3\%$. Bars are omitted for methods that do not reach the specified target.
  • Figure 5: Zoomed-in Time to Accuracy (TTA) curves for LLM training (fine-tuning) on an 8-GPU/4-worker testbed using ring all-reduce. Horizontal dashed lines indicate the final BF16 accuracy. As mentioned, MXFP4 and MXFP6 curves represent a best-case scenario based on upper-bound throughput estimation. DynamiQ is the only method to consistently converge faster than BF16 while roughly matching its perplexity and accuracy, followed by MXFP8. Although alternatives like THC and OR also show faster-than-baseline initial convergence (see Appendix \ref{['fig: e2e-tta-full']}), their performance ultimately stalls due to high compression error (see Appendix \ref{['fig: vnmse-e2e']}).
  • ...and 13 more figures