Whale: Efficient Giant Model Training over Heterogeneous GPUs

Xianyan Jia; Le Jiang; Ang Wang; Wencong Xiao; Ziji Shi; Jie Zhang; Xinyuan Li; Langshi Chen; Yong Li; Zhen Zheng; Xiaoyong Liu; Wei Lin

Whale: Efficient Giant Model Training over Heterogeneous GPUs

Xianyan Jia, Le Jiang, Ang Wang, Wencong Xiao, Ziji Shi, Jie Zhang, Xinyuan Li, Langshi Chen, Yong Li, Zhen Zheng, Xiaoyong Liu, Wei Lin

TL;DR

The paper tackles the bottlenecks of scaling giant models on heterogeneous GPUs by introducing Whale, a framework that unifies data, model, and hybrid parallelism through two high-level primitives and a hardware-aware runtime. It converts local model graphs into distributed execution using a parallel planner that generates VirtualDevices, partitions TaskGraphs, and inserts bridge layers, all while balancing workload across diverse hardware. Empirical results show strong performance gains, including 91% scalability for M6-10B across multiple GPUs and the ability to train models in the trillions of parameters with only minor code changes via M6-MoE. Whale thus offers a practical path to efficient, adaptable giant-model training, bridging programming ease and system optimization in heterogeneous environments.

Abstract

The scaling up of deep neural networks has been demonstrated to be effective in improving model quality, but also encompasses several training challenges in terms of training efficiency, programmability, and resource adaptability. We present Whale, a general and efficient distributed training framework for giant models. To support various parallel strategies and their hybrids, Whale generalizes the programming interface by defining two new primitives in the form of model annotations, allowing for incorporating user hints. The Whale runtime utilizes those annotations and performs graph optimizations to transform a local deep learning DAG graph for distributed multi-GPU execution. Whale further introduces a novel hardware-aware parallel strategy, which improves the performance of model training on heterogeneous GPUs in a balanced manner. Deployed in a production cluster with 512 GPUs, Whale successfully trains an industry-scale multimodal model with over ten trillion model parameters, named M6, demonstrating great scalability and efficiency.

Whale: Efficient Giant Model Training over Heterogeneous GPUs

TL;DR

Abstract

Whale: Efficient Giant Model Training over Heterogeneous GPUs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (19)