Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer Capacity

Zi Yu Xue; Yannan Nellie Wu; Joel S. Emer; Vivienne Sze

Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer Capacity

Zi Yu Xue, Yannan Nellie Wu, Joel S. Emer, Vivienne Sze

TL;DR

A speculative tensor tiling approach, called overbooking, to improve buffer utilization by taking advantage of the distribution of nonzero elements in sparse tensors to construct larger tiles with greater data reuse, and introduces a statistical approach, Swiftiles, to pick a tile size so that tiles usually fit within the buffer’s capacity, but can potentially overflow.

Abstract

Sparse tensor algebra is a challenging class of workloads to accelerate due to low arithmetic intensity and varying sparsity patterns. Prior sparse tensor algebra accelerators have explored tiling sparse data to increase exploitable data reuse and improve throughput, but typically allocate tile size in a given buffer for the worst-case data occupancy. This severely limits the utilization of available memory resources and reduces data reuse. Other accelerators employ complex tiling during preprocessing or at runtime to determine the exact tile size based on its occupancy. This paper proposes a speculative tensor tiling approach, called overbooking, to improve buffer utilization by taking advantage of the distribution of nonzero elements in sparse tensors to construct larger tiles with greater data reuse. To ensure correctness, we propose a low-overhead hardware mechanism, Tailors, that can tolerate data overflow by design while ensuring reasonable data reuse. We demonstrate that Tailors can be easily integrated into the memory hierarchy of an existing sparse tensor algebra accelerator. To ensure high buffer utilization with minimal tiling overhead, we introduce a statistical approach, Swiftiles, to pick a tile size so that tiles usually fit within the buffer's capacity, but can potentially overflow, i.e., it overbooks the buffers. Across a suite of 22 sparse tensor algebra workloads, we show that our proposed overbooking strategy introduces an average speedup of $52.7\times$ and $2.3\times$ and an average energy reduction of $22.5\times$ and $2.5\times$ over ExTensor without and with optimized tiling, respectively.

Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer Capacity

TL;DR

Abstract

and

and an average energy reduction of

and

over ExTensor without and with optimized tiling, respectively.

Paper Structure (33 sections, 3 equations, 13 figures, 2 tables)

This paper contains 33 sections, 3 equations, 13 figures, 2 tables.

Introduction
Background
Sparse Tensor Algebra
Tiling Sparse Tensors
Exploiting Sparsity with Coordinate-Space Tiling Requires Expensive Preprocessing
Position-Space Tiling Requires Expensive Runtime Operand Matching
Data Orchestration for Tiling
Hardware for Overbooking
General Concept
Explicit Decoupled Data Orchestration
Tailors
Realization of Tail Overbooking
Maintaining Support for Buffet Semantics
Example of Overbooking with Tailors
Overbooking Tiling Strategy
...and 18 more sections

Figures (13)

Figure 1: Occupancy distribution of tiles with a size of 51.4M. The tiles are obtained by partitioning tensors from SuiteSparse kolodziej_suitesparse_2019. The occupancy varies from tile to tile, the max tile occupancy is more than three orders of magnitude smaller than tile size, and 90% threshold tile occupancy is more than $15\times$ smaller than maximum tile occupancy.
Figure 2: Tiled sparse matrix multiplication between sparse 2-dimensional tensors ( i.e., matrices) $A$ and $B$, when tiling in (a) coordinate space and (b) position space for a buffer with a capacity of two for each operand. Each step shows the tiles operated on. Dotted yellow boxes indicate the tile in coordinate space. CST constructs $A$ and $B$ tiles with uniform shapes and thus does not require runtime operand matching. PST constructs $A$ and $B$ tiles of uniform occupancy, but can have potentially different shapes. Thus, PST requires a costly runtime traversal of $B$ both to determine its tiling and to search for all possible matching operands given a tile from $A$.
Figure 3: Comparison of data management between Tailors and buffets when (a) a tile from the stationary operand $A$ overbooks the buffer and (b) a tile from the non-stationary operand $B$ overbooks the buffer. Nonzeros in each sparse tensor are shown with colour and the tiles needed for the computation are outlined by dotted yellow boxes. Each state describes the data residing in the buffer after the data in the buffer changes. Data is removed from the buffer when the buffer is full and an element not residing in the buffer is required for an operation. Arrows are used to indicate data movement. An arrow into the buffer indicates data being written into the buffer, while an arrow out of the buffer indicates data being removed from the buffer. While the buffet continuously cycles data in the buffer, the Tailor is able to reuse a portion of the data.
Figure 4: (Left) A typical accelerator memory hierarchy made up of global buffers, PE buffers, and compute in each PE. Each buffer is associated with an address generator (AGEN) which generates addresses for future fills. (Middle) Tailors-defined operations on the buffer. (Right) Where data can be freed from the buffer for a given operation. Overwriting fills only modify the tail of the buffer when the buffer is full, while shrinks can modify the entire buffer starting from the head, and fills can modify the buffer when it is not full.
Figure 5: Tailors management following an example sequence of consecutive operations caused by overbooking with a buffer that can hold four elements. The FIFO-managed region is configured to hold two elements. Red boxes indicate the FIFO-managed region of the buffer and arrows indicate data movement. Arrows into the buffer indicate data fills from the parent, while arrows out of the buffer indicate data sent to the child. The FIFO Offset ( i.e., the difference between the FIFO Head and the index of the least recent data in the FIFO) and the Buffer Offset ( i.e., the location in the buffer) used to index into the buffer are shown. We implement the FIFO-managed region as a rolling buffer with a head pointer but show it with a fixed head position for simplicity.
...and 8 more figures

Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer Capacity

TL;DR

Abstract

Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer Capacity

Authors

TL;DR

Abstract

Table of Contents

Figures (13)