Table of Contents
Fetching ...

ReDas: A Lightweight Architecture for Supporting Fine-Grained Reshaping and Multiple Dataflows on Systolic Array

Meng Han, Liang Wang, Limin Xiao, Tianhao Cai, Zeyu Wang, Xiangrong Xu, Chenhao Zhang

TL;DR

ReDas addresses the inefficiency of fixed systolic arrays by introducing a lightweight, flexible accelerator that supports fine-grained reshaping and multiple dataflows via reconfigurable roundabout data paths and four surrounding multi-mode buffers. A dedicated mapping engine with an analytical model and interval sampling enables efficient layer-by-layer configuration, achieving up to $4.6\times$ speedup and $8.3\times$ EDP reduction over a conventional systolic array, while offering favorable area and power performance against prior flexible designs. The approach delivers high PE utilization across diverse DNN workloads and demonstrates practical viability through eight benchmarks and extensive comparisons to TPUv2, Gemmini, Planaria, DyNNamic, and SARA. This architecture promises notable improvements in DNN acceleration by balancing flexibility, latency, and energy with moderate hardware overhead.

Abstract

The systolic accelerator is one of the premier architectural choices for DNN acceleration. However, the conventional systolic architecture suffers from low PE utilization due to the mismatch between the fixed array and diverse DNN workloads. Recent studies have proposed flexible systolic array architectures to adapt to DNN models. However, these designs support only coarse-grained reshaping or significantly increase hardware overhead. In this study, we propose ReDas, a flexible and lightweight systolic array that supports dynamic fine-grained reshaping and multiple dataflows. First, ReDas integrates lightweight and reconfigurable roundabout data paths, which achieve fine-grained reshaping using only short connections between adjacent PEs. Second, we redesign the PE microarchitecture and integrate a set of multi-mode data buffers around the array. The PE structure enables additional data bypassing and flexible data switching. Simultaneously, the multi-mode buffers facilitate fine-grained reallocation of on-chip memory resources, adapting to various dataflow requirements. ReDas can dynamically reconfigure to up to 129 different logical shapes and 3 dataflows for a 128x128 array. Finally, we propose an efficient mapper to generate appropriate configurations for each layer of DNN workloads. Compared to the conventional systolic array, ReDas can achieve about 4.6x speedup and 8.3x energy-delay product (EDP) reduction.

ReDas: A Lightweight Architecture for Supporting Fine-Grained Reshaping and Multiple Dataflows on Systolic Array

TL;DR

ReDas addresses the inefficiency of fixed systolic arrays by introducing a lightweight, flexible accelerator that supports fine-grained reshaping and multiple dataflows via reconfigurable roundabout data paths and four surrounding multi-mode buffers. A dedicated mapping engine with an analytical model and interval sampling enables efficient layer-by-layer configuration, achieving up to speedup and EDP reduction over a conventional systolic array, while offering favorable area and power performance against prior flexible designs. The approach delivers high PE utilization across diverse DNN workloads and demonstrates practical viability through eight benchmarks and extensive comparisons to TPUv2, Gemmini, Planaria, DyNNamic, and SARA. This architecture promises notable improvements in DNN acceleration by balancing flexibility, latency, and energy with moderate hardware overhead.

Abstract

The systolic accelerator is one of the premier architectural choices for DNN acceleration. However, the conventional systolic architecture suffers from low PE utilization due to the mismatch between the fixed array and diverse DNN workloads. Recent studies have proposed flexible systolic array architectures to adapt to DNN models. However, these designs support only coarse-grained reshaping or significantly increase hardware overhead. In this study, we propose ReDas, a flexible and lightweight systolic array that supports dynamic fine-grained reshaping and multiple dataflows. First, ReDas integrates lightweight and reconfigurable roundabout data paths, which achieve fine-grained reshaping using only short connections between adjacent PEs. Second, we redesign the PE microarchitecture and integrate a set of multi-mode data buffers around the array. The PE structure enables additional data bypassing and flexible data switching. Simultaneously, the multi-mode buffers facilitate fine-grained reallocation of on-chip memory resources, adapting to various dataflow requirements. ReDas can dynamically reconfigure to up to 129 different logical shapes and 3 dataflows for a 128x128 array. Finally, we propose an efficient mapper to generate appropriate configurations for each layer of DNN workloads. Compared to the conventional systolic array, ReDas can achieve about 4.6x speedup and 8.3x energy-delay product (EDP) reduction.
Paper Structure (28 sections, 5 equations, 22 figures, 5 tables)

This paper contains 28 sections, 5 equations, 22 figures, 5 tables.

Figures (22)

  • Figure 1: Illustration of the execution process in the systolic array with different dataflows.
  • Figure 2: Ideal dataflow and physical shape of systolic array vary with each layer of DNN models. Total number of PEs is not greater than $2^{12}$ or $2^{14}$.
  • Figure 3: Normalized execution time under different situation. Fixed: a 128$\times$128 fixed PE array with WS dataflow. Ideal dataflow: the dataflow (WS/OS/IS) of PE array is assumed to optimally adapt to DNN models layer per layer, the shape is fixed as 128 $\times$ 128. Ideal shape: the shape of PE array is assumed to optimally adapt to DNN models layer per layer, and the total number of PE is not greater than $128 \times 128$. The dataflow is fixed as WS. Ideal shape & dataflow: the shape and dataflow of PE array are assumed to optimally adapt to DNN models layer per layer.
  • Figure 4: The area and leakage power of 1MB multi-ported buffer under varying aggregated buffer bandwidth conditions. Assume the clock frequency is 1GHz.
  • Figure 5: The overall architecture of ReDas.
  • ...and 17 more figures