Table of Contents
Fetching ...

MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices

Mohammadali Shakerdargah, Shan Lu, Chao Gao, Di Niu

TL;DR

MAS-Attention tackles the challenge of accelerating exact attention on resource-constrained edge devices by exploiting heterogeneous MAC and VEC units through a semi-synchronous, pipelined stream processing architecture. It introduces a multi-tier tiling scheme and a proactive buffer overwrite strategy to balance compute and memory constraints, enabling parallel execution of tiled MatMul and Softmax while preserving data dependencies. Offline optimization via MCTS/GA/Grid search identifies tiling factors that maximize hardware utilization and minimize I/O, with extensive evaluations showing up to 2.75× speedup and substantial energy savings on simulated edge hardware and real DaVinci NPU devices, as well as a 29.4% end-to-end latency reduction on a Stable Diffusion UNet workload. The approach demonstrates strong potential for practical edge AI workloads by significantly reducing latency and energy without sacrificing accuracy, though future work will extend support to training and broader hardware platforms.

Abstract

The advent of foundation models have revolutionized various fields, enabling unprecedented task accuracy and flexibility in computational linguistics, computer vision and other domains. Attention mechanism has become an essential component of foundation models, due to their superb capability of capturing correlations in a sequence. However, attention results in quadratic complexity in memory and compute as the context length grows. Although many fusion-based exact attention acceleration algorithms have been developed for datacenter-grade GPUs and accelerators leveraging multi-core parallelism and data locality, yet it remains a significant challenge to accelerate attention on resource-constrained edge neural accelerators with limited compute units and stringent on-chip caches. In this paper, we propose a scheme for exact attention inference acceleration on memory-constrained edge accelerators, by parallelizing the utilization of heterogeneous compute units, i.e., vector processing units and matrix processing units. Our method involves scheduling workloads onto these different compute units in a multi-tiered tiling scheme to process tiled vector workloads and matrix workloads in attention as two streams, respecting the workload dependencies. We search for tiling factors to maximize the parallelization of both compute units while considering I/O overhead, and propose a proactive cache overwrite strategy to avoid undesirable cache spills in reality. Extensive results based on open-sourced simulation frameworks show up to 2.75x speedup and 54% reduction in energy consumption as compared to the state-of-the-art attention fusion method (FLAT) in the edge computing scenario. Further experiments on a real-world edge neural processing unit demonstrate speedup of up to 1.76x for attention as compared to FLAT, without affecting model output accuracy.

MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices

TL;DR

MAS-Attention tackles the challenge of accelerating exact attention on resource-constrained edge devices by exploiting heterogeneous MAC and VEC units through a semi-synchronous, pipelined stream processing architecture. It introduces a multi-tier tiling scheme and a proactive buffer overwrite strategy to balance compute and memory constraints, enabling parallel execution of tiled MatMul and Softmax while preserving data dependencies. Offline optimization via MCTS/GA/Grid search identifies tiling factors that maximize hardware utilization and minimize I/O, with extensive evaluations showing up to 2.75× speedup and substantial energy savings on simulated edge hardware and real DaVinci NPU devices, as well as a 29.4% end-to-end latency reduction on a Stable Diffusion UNet workload. The approach demonstrates strong potential for practical edge AI workloads by significantly reducing latency and energy without sacrificing accuracy, though future work will extend support to training and broader hardware platforms.

Abstract

The advent of foundation models have revolutionized various fields, enabling unprecedented task accuracy and flexibility in computational linguistics, computer vision and other domains. Attention mechanism has become an essential component of foundation models, due to their superb capability of capturing correlations in a sequence. However, attention results in quadratic complexity in memory and compute as the context length grows. Although many fusion-based exact attention acceleration algorithms have been developed for datacenter-grade GPUs and accelerators leveraging multi-core parallelism and data locality, yet it remains a significant challenge to accelerate attention on resource-constrained edge neural accelerators with limited compute units and stringent on-chip caches. In this paper, we propose a scheme for exact attention inference acceleration on memory-constrained edge accelerators, by parallelizing the utilization of heterogeneous compute units, i.e., vector processing units and matrix processing units. Our method involves scheduling workloads onto these different compute units in a multi-tiered tiling scheme to process tiled vector workloads and matrix workloads in attention as two streams, respecting the workload dependencies. We search for tiling factors to maximize the parallelization of both compute units while considering I/O overhead, and propose a proactive cache overwrite strategy to avoid undesirable cache spills in reality. Extensive results based on open-sourced simulation frameworks show up to 2.75x speedup and 54% reduction in energy consumption as compared to the state-of-the-art attention fusion method (FLAT) in the edge computing scenario. Further experiments on a real-world edge neural processing unit demonstrate speedup of up to 1.76x for attention as compared to FLAT, without affecting model output accuracy.

Paper Structure

This paper contains 22 sections, 1 equation, 7 figures, 3 tables, 4 algorithms.

Figures (7)

  • Figure 1: Dataflow comparison between FLAT and MAS-Attention: FLAT executes tiled stages sequentially, while MAS-Attention performs MatMul and softmax operations semi-synchronously in parallel, maximizing compute utilization and significantly enhancing overall performance.
  • Figure 2: Selective Overwriting of $V$ Matrix to Halt MatMul Operation in MAS-Attention’s Memory Strategy.
  • Figure 3: Selective Overwriting of $K$ Matrix to Halt MatMul Operation in MAS-Attention’s Memory Strategy.
  • Figure 4: Simulated Edge Architecture Design
  • Figure 5: Normalized Execution Time Comparison Across Networks for Different Methods on Huawei MatePad Pro 13.2 with DaVinci DNN Accelerator
  • ...and 2 more figures