Table of Contents
Fetching ...

QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices

Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Yibo Zhu, Chuan Wu

TL;DR

QSync tackles the challenge of training DNNs on clusters with mixed training and inference GPUs by enabling synchronous data-parallel training through selective, per-operator quantization. It introduces a Predictor (comprising an Indicator and a Replayer) to estimate end-to-end latency and model perturbations from low-precision kernels, and an Allocator to greedily assign operator precisions under memory and throughput constraints, all supported by the LP-PyTorch backend. Empirical results show the predictor attains <5% throughput prediction error, with QSync delivering modest accuracy gains (roughly 0.27%–1.03%) and throughput improvements over uniform-precision baselines across CNN and transformer tasks on real hardware. The work demonstrates a viable path to high-utilization hybrid hardware for scalable, accurate, distributed training, reducing wasted idle inference-GPU capacity while preserving model quality.

Abstract

A number of production deep learning clusters have attempted to explore inference hardware for DNN training, at the off-peak serving hours with many inference GPUs idling. Conducting DNN training with a combination of heterogeneous training and inference GPUs, known as hybrid device training, presents considerable challenges due to disparities in compute capability and significant differences in memory capacity. We propose QSync, a training system that enables efficient synchronous data-parallel DNN training over hybrid devices by strategically exploiting quantized operators. According to each device's available resource capacity, QSync selects a quantization-minimized setting for operators in the distributed DNN training graph, minimizing model accuracy degradation but keeping the training efficiency brought by quantization. We carefully design a predictor with a bi-directional mixed-precision indicator to reflect the sensitivity of DNN layers on fixed-point and floating-point low-precision operators, a replayer with a neighborhood-aware cost mapper to accurately estimate the latency of distributed hybrid mixed-precision training, and then an allocator that efficiently synchronizes workers with minimized model accuracy degradation. QSync bridges the computational graph on PyTorch to an optimized backend for quantization kernel performance and flexible support for various GPU architectures. Extensive experiments show that QSync's predictor can accurately simulate distributed mixed-precision training with <5% error, with a consistent 0.27-1.03% accuracy improvement over the from-scratch training tasks compared to uniform precision.

QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices

TL;DR

QSync tackles the challenge of training DNNs on clusters with mixed training and inference GPUs by enabling synchronous data-parallel training through selective, per-operator quantization. It introduces a Predictor (comprising an Indicator and a Replayer) to estimate end-to-end latency and model perturbations from low-precision kernels, and an Allocator to greedily assign operator precisions under memory and throughput constraints, all supported by the LP-PyTorch backend. Empirical results show the predictor attains <5% throughput prediction error, with QSync delivering modest accuracy gains (roughly 0.27%–1.03%) and throughput improvements over uniform-precision baselines across CNN and transformer tasks on real hardware. The work demonstrates a viable path to high-utilization hybrid hardware for scalable, accurate, distributed training, reducing wasted idle inference-GPU capacity while preserving model quality.

Abstract

A number of production deep learning clusters have attempted to explore inference hardware for DNN training, at the off-peak serving hours with many inference GPUs idling. Conducting DNN training with a combination of heterogeneous training and inference GPUs, known as hybrid device training, presents considerable challenges due to disparities in compute capability and significant differences in memory capacity. We propose QSync, a training system that enables efficient synchronous data-parallel DNN training over hybrid devices by strategically exploiting quantized operators. According to each device's available resource capacity, QSync selects a quantization-minimized setting for operators in the distributed DNN training graph, minimizing model accuracy degradation but keeping the training efficiency brought by quantization. We carefully design a predictor with a bi-directional mixed-precision indicator to reflect the sensitivity of DNN layers on fixed-point and floating-point low-precision operators, a replayer with a neighborhood-aware cost mapper to accurately estimate the latency of distributed hybrid mixed-precision training, and then an allocator that efficiently synchronizes workers with minimized model accuracy degradation. QSync bridges the computational graph on PyTorch to an optimized backend for quantization kernel performance and flexible support for various GPU architectures. Extensive experiments show that QSync's predictor can accurately simulate distributed mixed-precision training with <5% error, with a consistent 0.27-1.03% accuracy improvement over the from-scratch training tasks compared to uniform precision.
Paper Structure (25 sections, 4 theorems, 27 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 25 sections, 4 theorems, 27 equations, 8 figures, 6 tables, 1 algorithm.

Key Result

Proposition 1

With the loss function unchanged, by using an unbiased quantizer for linear operators , we have $\mathop{\mathbb{E}}[\nabla f_s(\mathbf{x};\{b_{io}\,\mid\, o \in O \}))] = \mathop{\mathbb{E}}[\nabla f_s^{(0)}(\mathbf{x})]$.

Figures (8)

  • Figure 1: Illustration of QSync. QSync reduces the number of unnecessary quantized operators without sacrificing the overall training efficiency to recover model quality.
  • Figure 2: Full and partial resource sharing. Left: Full-sharing GPU has no strict resource isolation but the partial share has a strict resource reservation. Right: In training, the resource on the full-sharing inference GPU can be fully utilized for the training job. As opposed to this, in partial resource sharing, only a portion of the resource is made available.
  • Figure 3: QSync Workflow
  • Figure 4: Cost Composition of an Operator
  • Figure 5: Workflow of Replayer. (1) The local precision DAG is updated upon a change in operator precision, and the cost mapper traverses the graph to update the precisions of dependent operators. (2) The casting costs in the new precision DAG are calculated, and the pure operator execution cost is retrieved from the profiling results. (3) the local data flow graph (DFG) is updated, and (4) the global DFG is updated accordingly, which can be used in the overall training throughput simulation.
  • ...and 3 more figures

Theorems & Definitions (9)

  • Proposition 1: Unbiased Gradient
  • Theorem 1
  • Proposition 2: Tensor Quantization Variance
  • Proposition 3: Variance Increment
  • proof
  • proof : Variance of the Fixed-Point Quantization
  • proof : Variance of Floating-point Quantization
  • proof
  • proof