Table of Contents
Fetching ...

FLAASH: Flexible Accelerator Architecture for Sparse High-Order Tensor Contraction

Gabriel Kulp, Andrew Ensinger, Lizhong Chen

TL;DR

The paper addresses the challenge of hardware-accelerating unstructured sparsity in high-order tensor contractions, a common operation in ML. It presents FLAASH, a modular accelerator architecture consisting of Sparse Dot Product Engines, a job generator/dispatcher, tensor memory, and IO, which converts contractions into distributed dot-product tasks. Through Verilog-based implementation and benchmarks against PyTorch and TensorFlow on synthetic and ML workloads, FLAASH demonstrates substantial speedups (over 25x in core DL workloads) and favorable scaling with NNZ and tensor order. The work highlights a practical path to hardware support for high-order sparse tensors and identifies avenues for future optimizations in scheduling, caching, and dot-product decomposition.

Abstract

Tensors play a vital role in machine learning (ML) and often exhibit properties best explored while maintaining high-order. Efficiently performing ML computations requires taking advantage of sparsity, but generalized hardware support is challenging. This paper introduces FLAASH, a flexible and modular accelerator design for sparse tensor contraction that achieves over 25x speedup for a deep learning workload. Our architecture performs sparse high-order tensor contraction by distributing sparse dot products, or portions thereof, to numerous Sparse Dot Product Engines (SDPEs). Memory structure and job distribution can be customized, and we demonstrate a simple approach as a proof of concept. We address the challenges associated with control flow to navigate data structures, high-order representation, and high-sparsity handling. The effectiveness of our approach is demonstrated through various evaluations, showcasing significant speedup as sparsity and order increase.

FLAASH: Flexible Accelerator Architecture for Sparse High-Order Tensor Contraction

TL;DR

The paper addresses the challenge of hardware-accelerating unstructured sparsity in high-order tensor contractions, a common operation in ML. It presents FLAASH, a modular accelerator architecture consisting of Sparse Dot Product Engines, a job generator/dispatcher, tensor memory, and IO, which converts contractions into distributed dot-product tasks. Through Verilog-based implementation and benchmarks against PyTorch and TensorFlow on synthetic and ML workloads, FLAASH demonstrates substantial speedups (over 25x in core DL workloads) and favorable scaling with NNZ and tensor order. The work highlights a practical path to hardware support for high-order sparse tensors and identifies avenues for future optimizations in scheduling, caching, and dot-product decomposition.

Abstract

Tensors play a vital role in machine learning (ML) and often exhibit properties best explored while maintaining high-order. Efficiently performing ML computations requires taking advantage of sparsity, but generalized hardware support is challenging. This paper introduces FLAASH, a flexible and modular accelerator design for sparse tensor contraction that achieves over 25x speedup for a deep learning workload. Our architecture performs sparse high-order tensor contraction by distributing sparse dot products, or portions thereof, to numerous Sparse Dot Product Engines (SDPEs). Memory structure and job distribution can be customized, and we demonstrate a simple approach as a proof of concept. We address the challenges associated with control flow to navigate data structures, high-order representation, and high-sparsity handling. The effectiveness of our approach is demonstrated through various evaluations, showcasing significant speedup as sparsity and order increase.
Paper Structure (15 sections, 7 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 15 sections, 7 equations, 3 figures, 3 tables, 2 algorithms.

Figures (3)

  • Figure 1: FLAASH architecture overview in the center. Detail views below.
  • Figure 2: Simulations of Architecture Implemented in Verilog
  • Figure 3: Contraction Times (µs) vs Density (%) (a) $3 \times 3 \times 1024$ tensor contracted with a $3 \times 1024$ matrix to get an output tensor of $3 \times 3 \times 3$. (b) $7 \times 7 \times 512$ tensor contracted with a $7 \times 512$ matrix to get an output tensor of $7 \times 7 \times 7$. (c) $10 \times 10 \times 100$ tensor contracted with a $10 \times 100$ matrix to get an output tensor of $10 \times 10 \times 10$. Matrices invovled in contraction each have 50% density.