FLAASH: Flexible Accelerator Architecture for Sparse High-Order Tensor Contraction
Gabriel Kulp, Andrew Ensinger, Lizhong Chen
TL;DR
The paper addresses the challenge of hardware-accelerating unstructured sparsity in high-order tensor contractions, a common operation in ML. It presents FLAASH, a modular accelerator architecture consisting of Sparse Dot Product Engines, a job generator/dispatcher, tensor memory, and IO, which converts contractions into distributed dot-product tasks. Through Verilog-based implementation and benchmarks against PyTorch and TensorFlow on synthetic and ML workloads, FLAASH demonstrates substantial speedups (over 25x in core DL workloads) and favorable scaling with NNZ and tensor order. The work highlights a practical path to hardware support for high-order sparse tensors and identifies avenues for future optimizations in scheduling, caching, and dot-product decomposition.
Abstract
Tensors play a vital role in machine learning (ML) and often exhibit properties best explored while maintaining high-order. Efficiently performing ML computations requires taking advantage of sparsity, but generalized hardware support is challenging. This paper introduces FLAASH, a flexible and modular accelerator design for sparse tensor contraction that achieves over 25x speedup for a deep learning workload. Our architecture performs sparse high-order tensor contraction by distributing sparse dot products, or portions thereof, to numerous Sparse Dot Product Engines (SDPEs). Memory structure and job distribution can be customized, and we demonstrate a simple approach as a proof of concept. We address the challenges associated with control flow to navigate data structures, high-order representation, and high-sparsity handling. The effectiveness of our approach is demonstrated through various evaluations, showcasing significant speedup as sparsity and order increase.
