Towards a Standardized Representation for Deep Learning Collective Algorithms

Jinsun Yoo; William Won; Meghan Cowan; Nan Jiang; Benjamin Klenk; Srinivas Sridharan; Tushar Krishna

Towards a Standardized Representation for Deep Learning Collective Algorithms

Jinsun Yoo, William Won, Meghan Cowan, Nan Jiang, Benjamin Klenk, Srinivas Sridharan, Tushar Krishna

TL;DR

Addressing the fragmentation in distributed ML collective algorithm representations, the paper proposes a standardized Chakra ET-based workflow to unify workload and collective representations. It demonstrates a proof-of-concept by converting MSCCLang-generated algorithms into Chakra ET and running them in ASTRA-sim across multiple network topologies. The key contributions are: (i) a common Chakra ET representation for both workloads and collectives, (ii) a MSCCLang-to-Chakra ET converter, and (iii) an ASTRA-sim extension to execute Chakra ET-described algorithms. This standardization enables co-optimization of communication and computation and improves interoperability across upstream and downstream tools.

Abstract

The explosion of machine learning model size has led to its execution on distributed clusters at a very large scale. Many works have tried to optimize the process of producing collective algorithms and running collective communications, which act as a bottleneck to distributed machine learning. However, different works use their own collective algorithm representation, pushing away from co-optimizing collective communication and the rest of the workload. The lack of a standardized collective algorithm representation has also hindered interoperability between collective algorithm producers and consumers. Additionally, tool-specific conversions and modifications have to be made for each pair of tools producing and consuming collective algorithms which adds to engineering efforts. In this position paper, we propose a standardized workflow leveraging a common collective algorithm representation. Upstream producers and downstream consumers converge to a common representation format based on Chakra Execution Trace, a commonly used graph based representation of distributed machine learning workloads. Such a common representation enables us to view collective communications at the same level as workload operations and decouple producer and consumer tools, enhance interoperability, and relieve the user from the burden of having to focus on downstream implementations. We provide a proof-of-concept of this standardized workflow by simulating collective algorithms generated by the MSCCLang domain-specific language through the ASTRA-sim distributed machine learning simulator using various network configurations.

Towards a Standardized Representation for Deep Learning Collective Algorithms

TL;DR

Abstract

Paper Structure (14 sections, 4 figures, 1 table)

This paper contains 14 sections, 4 figures, 1 table.

Introduction
Background
Chakra Execution Trace
Upstream Collective Algorithm Producers
Downstream Distributed Machine Learning Tools
Collective Algorithm Representation
Motivation: Needs for Standardization
Solution: Using Chakra Execution Trace
Collective Algorithm in Chakra ET
Methodology
Representing MSCCLang Output in Chakra ET
Updating ASTRA-sim to Run Algorithms in Chakra ET
Evaluation
Conclusion and Future Work

Figures (4)

Figure 1: The proposed standardized workflow using Chakra Execution Trace (Chakra ET) as a common representation for both distributed ML workload and collective algorithm. Downstream tools receive both workload and collective algorithms represented using a common Chakra ET format. Sample Chakra ET s of the workload (left) and that of the algorithm of a single collective (right) is also shown.
Figure 2: Snippets of the Chakra ET used to represent the workload and Ring collective algorithm used in the evaluation.
Figure 3: Components of ASTRA-sim involved in the case study. We extended ASTRA-sim to inject collective algorithms represented in Chakra ET. The extension is marked with dashed squares.
Figure 4: The collective duration for a 1D Ring algorithm across different topologies of 64 NPUs.

Towards a Standardized Representation for Deep Learning Collective Algorithms

TL;DR

Abstract

Towards a Standardized Representation for Deep Learning Collective Algorithms

Authors

TL;DR

Abstract

Table of Contents

Figures (4)