Table of Contents
Fetching ...

Systolic Arrays and Structured Pruning Co-design for Efficient Transformers in Edge Systems

Pedro Palacios, Rafael Medina, Jean-Luc Rouas, Giovanni Ansaloni, David Atienza

TL;DR

The paper tackles the problem of efficient edge deployment of transformers by co-designing structured pruning with systolic array acceleration, introducing Systolic Array Structured Pruning (SASP). It presents a cross-stack framework that integrates pruning/quantization, system-level simulation, and hardware synthesis to explore how tile-aligned pruning interacts with array dimensions. Through ASR on LibriSpeech and MT on MuST-C, the study demonstrates up to 44% system-wide speedups and 42% energy savings with only 1.4% WER degradation at 20% pruning when using FP32_INT8 weights, and reveals that larger arrays yield diminishing returns due to reduced pruning opportunities and higher hardware costs. The findings indicate SASP’s potential for edge AI by offering tunable trade-offs between run-time and QoS under resource constraints, guiding co-design choices for accelerator design and pruning strategies.

Abstract

Efficient deployment of resource-intensive transformers on edge devices necessitates cross-stack optimization. We thus study the interrelation between structured pruning and systolic acceleration, matching the size of pruned blocks with the systolic array dimensions. In this setting, computations of pruned weight blocks can be skipped, reducing run-time and energy consumption, but potentially impacting quality of service (QoS). To evaluate the trade-offs between systolic array size and sparsity opportunities, we present a novel co-design framework that integrates algorithmic optimization, system simulation, and hardware design. Targeting speech recognition and machine translation using transformers as case study, we analyze how configuration choices across the stack affect performance metrics. Results demonstrate that structured pruning on systems featuring systolic array acceleration can effectively increase performance, while maintaining high QoS levels. Up to 44% system-wide speedups due to structured pruning and quantization were measured, with only 1.4% word error rate degradation on the standard LibriSpeech dataset.

Systolic Arrays and Structured Pruning Co-design for Efficient Transformers in Edge Systems

TL;DR

The paper tackles the problem of efficient edge deployment of transformers by co-designing structured pruning with systolic array acceleration, introducing Systolic Array Structured Pruning (SASP). It presents a cross-stack framework that integrates pruning/quantization, system-level simulation, and hardware synthesis to explore how tile-aligned pruning interacts with array dimensions. Through ASR on LibriSpeech and MT on MuST-C, the study demonstrates up to 44% system-wide speedups and 42% energy savings with only 1.4% WER degradation at 20% pruning when using FP32_INT8 weights, and reveals that larger arrays yield diminishing returns due to reduced pruning opportunities and higher hardware costs. The findings indicate SASP’s potential for edge AI by offering tunable trade-offs between run-time and QoS under resource constraints, guiding co-design choices for accelerator design and pruning strategies.

Abstract

Efficient deployment of resource-intensive transformers on edge devices necessitates cross-stack optimization. We thus study the interrelation between structured pruning and systolic acceleration, matching the size of pruned blocks with the systolic array dimensions. In this setting, computations of pruned weight blocks can be skipped, reducing run-time and energy consumption, but potentially impacting quality of service (QoS). To evaluate the trade-offs between systolic array size and sparsity opportunities, we present a novel co-design framework that integrates algorithmic optimization, system simulation, and hardware design. Targeting speech recognition and machine translation using transformers as case study, we analyze how configuration choices across the stack affect performance metrics. Results demonstrate that structured pruning on systems featuring systolic array acceleration can effectively increase performance, while maintaining high QoS levels. Up to 44% system-wide speedups due to structured pruning and quantization were measured, with only 1.4% word error rate degradation on the standard LibriSpeech dataset.

Paper Structure

This paper contains 14 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Qualitative radar plot illustrating two SASP solutions with different trade-offs: a slow and accurate one (red) employing a small accelerator and a low pruning rate, and a fast but inaccurate one (blue) using a large accelerator and a high pruning rate. Across all axes, higher is better.
  • Figure 2: Overview of Hardware-Software co-design framework.
  • Figure 3: Tiled matrix multiplication with structured pruning.
  • Figure 4: Architectural diagram of the systolic array, supporting FP32 activations and either non-quantized (FP32) or quantized (INT8) weights.
  • Figure 5: Hardware diagram of the hybrid FP32_INT8 multiplier. This logic is bypassed in case any of the operands is equal to zero.
  • ...and 6 more figures