Table of Contents
Fetching ...

Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control

Zhigang Wang, Xu Zhang, Ning Wang, Chuanfei Xu, Jie Nie, Zhiqiang Wei, Yu Gu, Ge Yu

TL;DR

This work tackles straggling in tensor-parallel training of large transformers on heterogeneous, multi-tenant hardware. It introduces ZERO-resizing to dynamically prune matrix dimensions with imputation and a priority mechanism, and SEMI-migration to hybridize resizing with lightweight data migration using broadcast-reduce and local merging. A cost-aware allocation framework guides when to resize versus migrate, yielding substantial runtime improvements with minimal accuracy loss, as demonstrated on ViT-scale models within Colossal-AI. The proposed approach enables efficient, scalable training of billions-parameter models on more economically accessible heterogeneous clusters, broadening feasibility for academia and industry alike.

Abstract

Transformer-based models are becoming deeper and larger recently. For better scalability, an underlying training solution in industry is to split billions of parameters (tensors) into many tasks and then run them across homogeneous accelerators (e.g., GPUs). However, such dedicated compute cluster is prohibitively expensive in academia and moderate companies. An economic replacement is to aggregate existing heterogeneous devices and share resources among multi-tenants. Nevertheless, static hardware configurations and dynamic resource contention definitely cause straggling tasks, which heavily slows down the overall training efficiency. Existing works feature contributions mainly tailored for traditional data parallelism. They cannot work well for the new tensor parallelism due to strict communication and correctness constraints. In this paper we first present ZERO-resizing, a novel dynamic workload balancing technique without any data migration. We tune workloads in real-time by temporarily resizing matrices involved in core tensor-related computations. We particularly design data imputation and priority selection policies to respectively satisfy consistency constraint required by normal training and reduce the accuracy loss. We also give a lightweight data migration technique without loss of accuracy, to cope with heavy heterogeneity. Our final SEMI-migration solution is built on top of these two techniques and can adaptively distinguish their respective balancing missions, to achieve an overall success in efficiency and accuracy. Extensive experiments on the representative Colossal-AI platform validate the effectiveness of our proposals.

Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control

TL;DR

This work tackles straggling in tensor-parallel training of large transformers on heterogeneous, multi-tenant hardware. It introduces ZERO-resizing to dynamically prune matrix dimensions with imputation and a priority mechanism, and SEMI-migration to hybridize resizing with lightweight data migration using broadcast-reduce and local merging. A cost-aware allocation framework guides when to resize versus migrate, yielding substantial runtime improvements with minimal accuracy loss, as demonstrated on ViT-scale models within Colossal-AI. The proposed approach enables efficient, scalable training of billions-parameter models on more economically accessible heterogeneous clusters, broadening feasibility for academia and industry alike.

Abstract

Transformer-based models are becoming deeper and larger recently. For better scalability, an underlying training solution in industry is to split billions of parameters (tensors) into many tasks and then run them across homogeneous accelerators (e.g., GPUs). However, such dedicated compute cluster is prohibitively expensive in academia and moderate companies. An economic replacement is to aggregate existing heterogeneous devices and share resources among multi-tenants. Nevertheless, static hardware configurations and dynamic resource contention definitely cause straggling tasks, which heavily slows down the overall training efficiency. Existing works feature contributions mainly tailored for traditional data parallelism. They cannot work well for the new tensor parallelism due to strict communication and correctness constraints. In this paper we first present ZERO-resizing, a novel dynamic workload balancing technique without any data migration. We tune workloads in real-time by temporarily resizing matrices involved in core tensor-related computations. We particularly design data imputation and priority selection policies to respectively satisfy consistency constraint required by normal training and reduce the accuracy loss. We also give a lightweight data migration technique without loss of accuracy, to cope with heavy heterogeneity. Our final SEMI-migration solution is built on top of these two techniques and can adaptively distinguish their respective balancing missions, to achieve an overall success in efficiency and accuracy. Extensive experiments on the representative Colossal-AI platform validate the effectiveness of our proposals.
Paper Structure (21 sections, 3 equations, 11 figures, 1 table, 2 algorithms)

This paper contains 21 sections, 3 equations, 11 figures, 1 table, 2 algorithms.

Figures (11)

  • Figure 1: Tensor parallelism in a FFN layer
  • Figure 2: Matrix pruning and imputation process ($\gamma=0.5)$
  • Figure 3: Impact of different imputation policies on the model accuracy (ACC)
  • Figure 4: Illustration of the sending-collecting migration
  • Figure 5: Overall performance in homogeneous environments (ViT-1B)
  • ...and 6 more figures