Table of Contents
Fetching ...

Multilayer Dataflow: Orchestrate Butterfly Sparsity to Accelerate Attention Computation

Haibin Wu, Wenming Li, Kai Yan, Zhihua Fan, Peiyang Wu, Yuqun Liu, Yanhuan Liu, Ziqing Qiang, Meng Wu, Kunming Liu, Xiaochun Ye, Dongrui Fan

TL;DR

This work proposes a hybrid butterfly-sparsity network to obtain better trade-offs between attention accuracy and performance and proposes a scalable multilayer dataflow method supported by coarse-grained streaming parallelism designs, to orchestrate the butterfly sparsity computation on the dataflow array.

Abstract

Recent neural networks (NNs) with self-attention exhibit competitiveness across different AI domains, but the essential attention mechanism brings massive computation and memory demands. To this end, various sparsity patterns are introduced to reduce the quadratic computation complexity, among which the structured butterfly sparsity has been proven efficient in computation reduction while maintaining model accuracy. However, its complicated data accessing pattern brings utilization degradation and makes parallelism hard to exploit in general block-oriented architecture like GPU. Since the reconfigurable dataflow architecture is known to have better data reusability and architectural flexibility in general NN-based acceleration, we want to apply it to the butterfly sparsity for acquiring better computational efficiency for attention workloads. We first propose a hybrid butterfly-sparsity network to obtain better trade-offs between attention accuracy and performance. Next, we propose a scalable multilayer dataflow method supported by coarse-grained streaming parallelism designs, to orchestrate the butterfly sparsity computation on the dataflow array. The experiments show that compared with Jetson Xavier NX, our design has a speedup of up to $14.34\times$ ($9.29\times$ on average) as well as $11.14\times$ energy efficiency advancement in attention workloads. In comparison with SOTA attention accelerators of the same peak performance, our dataflow architecture acquires $2.38\times$-$4.7\times$ efficiency improvement as well as $6.60\times$-$15.37\times$ energy reduction with butterfly sparsity optimization.

Multilayer Dataflow: Orchestrate Butterfly Sparsity to Accelerate Attention Computation

TL;DR

This work proposes a hybrid butterfly-sparsity network to obtain better trade-offs between attention accuracy and performance and proposes a scalable multilayer dataflow method supported by coarse-grained streaming parallelism designs, to orchestrate the butterfly sparsity computation on the dataflow array.

Abstract

Recent neural networks (NNs) with self-attention exhibit competitiveness across different AI domains, but the essential attention mechanism brings massive computation and memory demands. To this end, various sparsity patterns are introduced to reduce the quadratic computation complexity, among which the structured butterfly sparsity has been proven efficient in computation reduction while maintaining model accuracy. However, its complicated data accessing pattern brings utilization degradation and makes parallelism hard to exploit in general block-oriented architecture like GPU. Since the reconfigurable dataflow architecture is known to have better data reusability and architectural flexibility in general NN-based acceleration, we want to apply it to the butterfly sparsity for acquiring better computational efficiency for attention workloads. We first propose a hybrid butterfly-sparsity network to obtain better trade-offs between attention accuracy and performance. Next, we propose a scalable multilayer dataflow method supported by coarse-grained streaming parallelism designs, to orchestrate the butterfly sparsity computation on the dataflow array. The experiments show that compared with Jetson Xavier NX, our design has a speedup of up to ( on average) as well as energy efficiency advancement in attention workloads. In comparison with SOTA attention accelerators of the same peak performance, our dataflow architecture acquires - efficiency improvement as well as - energy reduction with butterfly sparsity optimization.

Paper Structure

This paper contains 24 sections, 4 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Transformer network with butterfly pattern simplification.
  • Figure 2: Profiling original dense-based attention and fft-based attention kernels of VIT and BERT on the GPU platform - Jetson Xavier NX.
  • Figure 3: Fundamental patterns in existing sparsity variants.
  • Figure 4: The consecutive weight matrices multiplication of butterfly sparsity in diverse stride patterns.
  • Figure 5: Butterfly sparsity computation expressed as the dataflow execution with data dependence in multiple layers.
  • ...and 12 more figures