Table of Contents
Fetching ...

FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching

Jianming Tong, Anirudh Itagi, Prasanth Chatarasi, Tushar Krishna

TL;DR

The paper addresses the gap between theory and practice in exploiting per-layer dataflows for ML inference by highlighting the critical role of on-chip data layout and bank conflicts. It introduces FEATHER, a reconfigurable accelerator featuring NEST for flexible dataflows and BIRRD for reordering-in-reduction (RIR), enabling seamless dataflow-layout co-switching with negligible overhead. A Layoutloop tool extends Timeloop to model physical storage and layout-aware dataflow search, enabling per-layer optimization that minimizes energy-delay product. End-to-end FPGA deployment and Layoutloop-based evaluations show FEATHER achieving substantial latency and energy improvements over SoTA accelerators (e.g., up to 3.91x throughput versus Gemmini and up to 4.91x vs Edge TPU) with only about 6% extra area, while enabling flexible per-layer optimization across diverse models such as ResNet-50 and BERT. Collectively, FEATHER demonstrates practical, scalable gains by tightly integrating dataflow flexibility with on-chip data layout reordering, guided by Layoutloop for effective design-space exploration.

Abstract

The inference of ML models composed of diverse structures, types, and sizes boils down to the execution of different dataflows (i.e. different tiling, ordering, parallelism, and shapes). Using the optimal dataflow for every layer of workload can reduce latency by up to two orders of magnitude over a suboptimal dataflow. Unfortunately, reconfiguring hardware for different dataflows involves on-chip data layout reordering and datapath reconfigurations, leading to non-trivial overhead that hinders ML accelerators from exploiting different dataflows, resulting in suboptimal performance. To address this challenge, we propose FEATHER, an innovative accelerator that leverages a novel spatial array termed Nest and a novel multi-stage reduction network called BIRRD for performing flexible data reduction with layout reordering under the hood, enabling seamless switching between optimal dataflows with negligible latency and resources overhead. For systematically evaluating the performance interaction between dataflows and layouts, we enhance Timeloop, a state-of-the-art dataflow cost modeling and search framework, with layout assessment capabilities, and term it as Layoutloop. We model FEATHER into Layoutloop and also deploy FEATHER end-to-end on the edge ZCU104 FPGA. FEATHER delivers 1.27~2.89x inference latency speedup and 1.3~6.43x energy efficiency improvement compared to various SoTAs like NVDLA, SIGMA and Eyeriss under ResNet-50 and MobiletNet-V3 in Layoutloop. On practical FPGA devices, FEATHER achieves 2.65/3.91x higher throughput than Xilinx DPU/Gemmini. Remarkably, such performance and energy efficiency enhancements come at only 6% area over a fixed-dataflow Eyeriss-like accelerator. Our code is released at https://github.com/maeri-project/FEATHER.

FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching

TL;DR

The paper addresses the gap between theory and practice in exploiting per-layer dataflows for ML inference by highlighting the critical role of on-chip data layout and bank conflicts. It introduces FEATHER, a reconfigurable accelerator featuring NEST for flexible dataflows and BIRRD for reordering-in-reduction (RIR), enabling seamless dataflow-layout co-switching with negligible overhead. A Layoutloop tool extends Timeloop to model physical storage and layout-aware dataflow search, enabling per-layer optimization that minimizes energy-delay product. End-to-end FPGA deployment and Layoutloop-based evaluations show FEATHER achieving substantial latency and energy improvements over SoTA accelerators (e.g., up to 3.91x throughput versus Gemmini and up to 4.91x vs Edge TPU) with only about 6% extra area, while enabling flexible per-layer optimization across diverse models such as ResNet-50 and BERT. Collectively, FEATHER demonstrates practical, scalable gains by tightly integrating dataflow flexibility with on-chip data layout reordering, guided by Layoutloop for effective design-space exploration.

Abstract

The inference of ML models composed of diverse structures, types, and sizes boils down to the execution of different dataflows (i.e. different tiling, ordering, parallelism, and shapes). Using the optimal dataflow for every layer of workload can reduce latency by up to two orders of magnitude over a suboptimal dataflow. Unfortunately, reconfiguring hardware for different dataflows involves on-chip data layout reordering and datapath reconfigurations, leading to non-trivial overhead that hinders ML accelerators from exploiting different dataflows, resulting in suboptimal performance. To address this challenge, we propose FEATHER, an innovative accelerator that leverages a novel spatial array termed Nest and a novel multi-stage reduction network called BIRRD for performing flexible data reduction with layout reordering under the hood, enabling seamless switching between optimal dataflows with negligible latency and resources overhead. For systematically evaluating the performance interaction between dataflows and layouts, we enhance Timeloop, a state-of-the-art dataflow cost modeling and search framework, with layout assessment capabilities, and term it as Layoutloop. We model FEATHER into Layoutloop and also deploy FEATHER end-to-end on the edge ZCU104 FPGA. FEATHER delivers 1.27~2.89x inference latency speedup and 1.3~6.43x energy efficiency improvement compared to various SoTAs like NVDLA, SIGMA and Eyeriss under ResNet-50 and MobiletNet-V3 in Layoutloop. On practical FPGA devices, FEATHER achieves 2.65/3.91x higher throughput than Xilinx DPU/Gemmini. Remarkably, such performance and energy efficiency enhancements come at only 6% area over a fixed-dataflow Eyeriss-like accelerator. Our code is released at https://github.com/maeri-project/FEATHER.
Paper Structure (64 sections, 14 figures, 5 tables, 1 algorithm)

This paper contains 64 sections, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: Terminology of convolution workload and dataflow
  • Figure 2: Latency evaluation of dataflows on $16\times 16$ PE array with various layouts (error bar shows layout impacts, less latency is better). The best flexible dataflow (green bar) theoretically reduces overall latency of fixed dataflow-layout (blue bar) by $63.3$%. However, ignoring the impact of layout considerations in theoretical dataflows results in up to a $128\times$ latency gap in practice (yellow bar). FEATHER eliminates the gap by co-switching dataflow-layout (red bar).
  • Figure 3: Layout terminology example: 'CHW_W4H2C2'. 'CHW' signifies the inter-line dimension order as C$\rightarrow$H$\rightarrow$W across lines. 'W4H2C2' indicates the intra-line dimension order: (4,2,2) elements from the (W,H,C) dimensions are flattened into a single row in the order of W$\rightarrow$H$\rightarrow$C.
  • Figure 4: Memory efficiency and computation utilization of various (workload, dataflow, data layout) combinations on weight-stationary $4\times4$ Systolic Array (SA). Dataflows: input channel-parallel (D1) and sliding-window parallel (D2). Dataflow D1/D2 reads at most four iActs from C/W dimension concurrently from the on-chip buffer every cycle, separately. The digit in iActs indicates the cycle index such iActs get read. Workloads: (1) ResNet-50 layer 1 with a large height and width, and (2) ResNet-50 layer 47 with a large channel number. Layouts: channel last-layout (L1, L3) and row-major layout (L2, L4). In the channel-last layout, data from different input channels (dimension $C$) are spread across an individual line, while in the row-major layout, multiple data from different input width (dimension $W$) are flattened. The performance of mappings (M1$\sim$M8) for different (workload, dataflow, layout) combinations are analyzed in the tables. In each table, "iActs Required by Mapping" lists all iActs that need to be concurrently read from on-chip buffer every cycle, and the corresponding index (#) of lines being accessed are listed in "Line # being Accessed". We assume dual read ports (because TSMC offers SRAM with at most two ports), such that a concurrent read for more than two lines leads to slowdown, which reduces "Theoretical Computation Utilization" (estimated as mapping efficiency over the array) into "Practical Compute Utilization" (computed as multiplication of theoretical utilization with slow down). Takeaway: For optimal performance, co-switching (dataflow, layout) is crucial, because dataflow matters (comparing M1 vs. M4), and layout also matters (comparing M2 vs. M4).
  • Figure 5: Overview of reordering patterns. The 2D layout without any reordering is shown in \ref{['fig:typical_2d_buffer']}, which only allows reading two rows concurrently, assuming true dual-port SRAM. Line Rotation (\ref{['fig:line_rotation']}, e.g., Medusa Medusa) moves a row from bank 0 to bank 1 prior to reading, enabling simultaneous access to at most three rows from bank 0 through dual-bank ports. This technique, however, utilizes additional port from bank 1, potentially limiting access to other data in bank 1. Transpose (\ref{['fig:transpose']}, e.g., MTIA MITA and TPUv4i tpuv4i) could swap rows with columns. Row Reorder (\ref{['fig:row_reorder']}, e.g., TPUv4i tpuv4i) permutes data within each row. Arbitrary reorder (\ref{['fig:arbitrary_reorder']}, proposed in this work) enables arbitrary permutation for data within the entire 2D buffer. Line Rotation, Transpose and Row-Reorder are done by prior works by reading at most two rows per bank, leverage Transpose/Permute unit to reorder and then write data back in concordant order (On-chip RAR in \ref{['fig:on_chip_reorder']}). In contrast, FEATHER's BIRRD network (§\ref{['sec:afft']}) performs the Arbitrary-Reorder during the reduction phase of the matrix multiplication or convolution computation (RIR in Fig. \ref{['fig:reorder_in_reduction']}). The concordant dataflow space supported by each layout reorder pattern is shown in \ref{['fig:reorder_cmp']}. Reordering enables a given layout to alter the order of data it could provide per cycle and across cycles. Among four dimensions (T,O,P,S) of concordant dataflow space, reordering enlarges O,P,S by supporting dataflows to read from or write to layout in different order. Note that reordering by itself cannot enlarge T dimension flexibility because higher Tiles flexibility requires accessing more data per cycle.
  • ...and 9 more figures