Table of Contents
Fetching ...

OneFlow: Redesign the Distributed Deep Learning Framework from Scratch

Jinhui Yuan, Xinqi Li, Cheng Cheng, Juncheng Liu, Ran Guo, Shenghang Cai, Chi Yao, Fei Yang, Xiaodong Yi, Chuan Wu, Haoran Zhang, Jie Zhao

TL;DR

OneFlow addresses the limitations of existing DL frameworks in handling large models and diverse parallelism by redesigning distributed training around SBP (split, broadcast, partial-value) and an actor-model runtime. The compiler converts a logical graph and hardware placement into a physical execution plan with expressive SBP signatures and data routing via boxing ops, while the runtime orchestrates execution through message-passing actors with explicit resource dependencies and back-pressure. The paper demonstrates broad applicability across data preprocessing, data/model/pipeline parallelism, and optimizer sharding, with results showing competitive or superior performance to tailored systems like Megatron-LM, HugeCTR, and ZeRO-DP baselines. The work offers a simplified, more flexible approach to large-scale distributed DL, with future directions including elastic scaling and auto placement.

Abstract

Deep learning frameworks such as TensorFlow and PyTorch provide a productive interface for expressing and training a deep neural network (DNN) model on a single device or using data parallelism. Still, they may not be flexible or efficient enough in training emerging large models on distributed devices, which require more sophisticated parallelism beyond data parallelism. Plugins or wrappers have been developed to strengthen these frameworks for model or pipeline parallelism, but they complicate the usage and implementation of distributed deep learning. Aiming at a simple, neat redesign of distributed deep learning frameworks for various parallelism paradigms, we present OneFlow, a novel distributed training framework based on an SBP (split, broadcast and partial-value) abstraction and the actor model. SBP enables much easier programming of data parallelism and model parallelism than existing frameworks, and the actor model provides a succinct runtime mechanism to manage the complex dependencies imposed by resource constraints, data movement and computation in distributed deep learning. We demonstrate the general applicability and efficiency of OneFlow for training various large DNN models with case studies and extensive experiments. The results show that OneFlow outperforms many well-known customized libraries built on top of the state-of-the-art frameworks. The code of OneFlow is available at: https://github.com/Oneflow-Inc/oneflow.

OneFlow: Redesign the Distributed Deep Learning Framework from Scratch

TL;DR

OneFlow addresses the limitations of existing DL frameworks in handling large models and diverse parallelism by redesigning distributed training around SBP (split, broadcast, partial-value) and an actor-model runtime. The compiler converts a logical graph and hardware placement into a physical execution plan with expressive SBP signatures and data routing via boxing ops, while the runtime orchestrates execution through message-passing actors with explicit resource dependencies and back-pressure. The paper demonstrates broad applicability across data preprocessing, data/model/pipeline parallelism, and optimizer sharding, with results showing competitive or superior performance to tailored systems like Megatron-LM, HugeCTR, and ZeRO-DP baselines. The work offers a simplified, more flexible approach to large-scale distributed DL, with future directions including elastic scaling and auto placement.

Abstract

Deep learning frameworks such as TensorFlow and PyTorch provide a productive interface for expressing and training a deep neural network (DNN) model on a single device or using data parallelism. Still, they may not be flexible or efficient enough in training emerging large models on distributed devices, which require more sophisticated parallelism beyond data parallelism. Plugins or wrappers have been developed to strengthen these frameworks for model or pipeline parallelism, but they complicate the usage and implementation of distributed deep learning. Aiming at a simple, neat redesign of distributed deep learning frameworks for various parallelism paradigms, we present OneFlow, a novel distributed training framework based on an SBP (split, broadcast and partial-value) abstraction and the actor model. SBP enables much easier programming of data parallelism and model parallelism than existing frameworks, and the actor model provides a succinct runtime mechanism to manage the complex dependencies imposed by resource constraints, data movement and computation in distributed deep learning. We demonstrate the general applicability and efficiency of OneFlow for training various large DNN models with case studies and extensive experiments. The results show that OneFlow outperforms many well-known customized libraries built on top of the state-of-the-art frameworks. The code of OneFlow is available at: https://github.com/Oneflow-Inc/oneflow.

Paper Structure

This paper contains 25 sections, 16 figures, 4 tables.

Figures (16)

  • Figure 1: A typical DL framework which translates the logical graph of a three-layer NN to a physical graph (or execution plan) on 4 inter-connected devices.
  • Figure 2: An example where deadlock may result with the scheduler in existing frameworks.
  • Figure 3: Interaction between callback function and the scheduler.
  • Figure 4: Example of 4 SBP signatures to map a $2\times 2$ global tensor to two devices. Each block in the figure indicates an entry of a tensor.
  • Figure 5: Example showing data movement with a boxing op inserted, when translating a logical graph into a physical graph.
  • ...and 11 more figures