FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching
Jianming Tong, Anirudh Itagi, Prasanth Chatarasi, Tushar Krishna
TL;DR
The paper addresses the gap between theory and practice in exploiting per-layer dataflows for ML inference by highlighting the critical role of on-chip data layout and bank conflicts. It introduces FEATHER, a reconfigurable accelerator featuring NEST for flexible dataflows and BIRRD for reordering-in-reduction (RIR), enabling seamless dataflow-layout co-switching with negligible overhead. A Layoutloop tool extends Timeloop to model physical storage and layout-aware dataflow search, enabling per-layer optimization that minimizes energy-delay product. End-to-end FPGA deployment and Layoutloop-based evaluations show FEATHER achieving substantial latency and energy improvements over SoTA accelerators (e.g., up to 3.91x throughput versus Gemmini and up to 4.91x vs Edge TPU) with only about 6% extra area, while enabling flexible per-layer optimization across diverse models such as ResNet-50 and BERT. Collectively, FEATHER demonstrates practical, scalable gains by tightly integrating dataflow flexibility with on-chip data layout reordering, guided by Layoutloop for effective design-space exploration.
Abstract
The inference of ML models composed of diverse structures, types, and sizes boils down to the execution of different dataflows (i.e. different tiling, ordering, parallelism, and shapes). Using the optimal dataflow for every layer of workload can reduce latency by up to two orders of magnitude over a suboptimal dataflow. Unfortunately, reconfiguring hardware for different dataflows involves on-chip data layout reordering and datapath reconfigurations, leading to non-trivial overhead that hinders ML accelerators from exploiting different dataflows, resulting in suboptimal performance. To address this challenge, we propose FEATHER, an innovative accelerator that leverages a novel spatial array termed Nest and a novel multi-stage reduction network called BIRRD for performing flexible data reduction with layout reordering under the hood, enabling seamless switching between optimal dataflows with negligible latency and resources overhead. For systematically evaluating the performance interaction between dataflows and layouts, we enhance Timeloop, a state-of-the-art dataflow cost modeling and search framework, with layout assessment capabilities, and term it as Layoutloop. We model FEATHER into Layoutloop and also deploy FEATHER end-to-end on the edge ZCU104 FPGA. FEATHER delivers 1.27~2.89x inference latency speedup and 1.3~6.43x energy efficiency improvement compared to various SoTAs like NVDLA, SIGMA and Eyeriss under ResNet-50 and MobiletNet-V3 in Layoutloop. On practical FPGA devices, FEATHER achieves 2.65/3.91x higher throughput than Xilinx DPU/Gemmini. Remarkably, such performance and energy efficiency enhancements come at only 6% area over a fixed-dataflow Eyeriss-like accelerator. Our code is released at https://github.com/maeri-project/FEATHER.
