EinDecomp: Decomposition of Declaratively-Specified Machine Learning and Numerical Computations for Parallel Execution

Daniel Bourgeois; Zhimin Ding; Dimitrije Jankov; Jiehui Li; Mahmoud Sleem; Yuxin Tang; Jiawen Yao; Xinyu Yao; Chris Jermaine

EinDecomp: Decomposition of Declaratively-Specified Machine Learning and Numerical Computations for Parallel Execution

Daniel Bourgeois, Zhimin Ding, Dimitrije Jankov, Jiehui Li, Mahmoud Sleem, Yuxin Tang, Jiawen Yao, Xinyu Yao, Chris Jermaine

TL;DR

It is shown that any computation specified in the Einstein summation notation can be re-written into an equivalent tensor-relational computation that facilitates intra-operator parallelism, and this re-write generalizes existing notations of tensor parallelism such as "data parallel" and "model parallel."

Abstract

We consider the problem of automatically decomposing operations over tensors or arrays so that they can be executed in parallel on multiple devices. We address two, closely-linked questions. First, what programming abstraction should systems for tensor-based computing offer to enable such decompositions? Second, given that abstraction, how should such systems automatically decompose a tensor-based computation? We assert that tensor-based systems should offer a programming abstraction based on an extended Einstein summation notation, which is a fully declarative, mathematical specification for tensor computations. We show that any computation specified in the Einstein summation notation can be re-written into an equivalent tensor-relational computation, and this re-write generalizes existing notations of tensor parallelism such as "data parallel'' and "model parallel.'' We consider the algorithmic problem of optimally computing a tensor-relational decomposition of a graph of operations specified in our extended Einstein summation notation, and we experimentally show the value of the algorithm that we develop.

EinDecomp: Decomposition of Declaratively-Specified Machine Learning and Numerical Computations for Parallel Execution

TL;DR

Abstract

Paper Structure (22 sections, 27 equations, 11 figures)

This paper contains 22 sections, 27 equations, 11 figures.

Introduction
Paper Roadmap
EinSum Background and Examples
Re-Writing EinSum to TRA
Tensor Relations
The Tensor-Relational Algebra
EinSum As a Tensor-Relational Programming Language
Parallelism via the Partitioning Vector
Optimizing the Decomposition
Ensuring Enough Parallel Work
Costing A Decomposition
The EinDecomp Algorithm
Counting EinSum Partitionings
Dynamic Programming
Computing the Optimal Cost During DP
...and 7 more sections

Figures (11)

Figure 1: Four tensor-relational partitionings for $\textbf{Z}_{i,k} \leftarrow \sum \textbf{X}_{i,j} \times \textbf{Y}_{j,k}$. In each there are 16 kernel calls.
Figure 2: Dataflow graphs associated with the partitionings of Figure \ref{['fig:part']} For partitionings $\textbf{d} = [4, 1, 1, 4]$ and $\textbf{d} = [2, 1, 1, 8]$, there is only a join layer, as the joined dimensions are not partitioned. For $\textbf{d} = [2, 4, 4, 2]$ and $\textbf{d} = [2, 2, 2, 4]$ there is also an aggregation.
Figure 3: Modifying an EinGraph supplied by a programmer, to produce a TaskGraph, by adding bound vectors.
Figure 4: Modifying an EinGraph supplied by a programmer, to produce a TaskGraph, by adding bound vectors. This is done so as to minimize an upper bound on the communication required for the corresponding decomposition.
Figure 5: Progression of the EinDecmop dynamic programming algorithm via a topological sort. After step 1, the lookup table $M$ holds lowest cost for producing all possible output partitionings of vertex 1. After step 2, $M$ holds the lowest costs for both vertex 1 and 2. And in general, after step $n$, $M$ holds the lowest cost for producing all possible output partitionings of vertices 1 through $n$.
...and 6 more figures

EinDecomp: Decomposition of Declaratively-Specified Machine Learning and Numerical Computations for Parallel Execution

TL;DR

Abstract

EinDecomp: Decomposition of Declaratively-Specified Machine Learning and Numerical Computations for Parallel Execution

Authors

TL;DR

Abstract

Table of Contents

Figures (11)