DCP: Learning Accelerator Dataflow for Neural Network via Propagation
Peng Xu, Wenqi Shao, Mingyu Ding, Ping Luo
TL;DR
DCP addresses the onerous task of hand-crafting accelerator dataflows by learning a differentiable optimization pipeline that maps unified encodings of DNN layers and dataflows to hardware metrics. A neural predictor is trained on a large, synthetic benchmark built from a 7D DNN-layer code $x$ and a $(3,7,2)$ dataflow code $y$, with metrics obtained via the MAESTRO-style simulator. After training, DCP updates dataflow codes through gradient-based backpropagation to minimize or trade off latency, energy, and EDP, at layer, model, or multi-objective levels, while enforcing HW constraints. Experiments across MobileNet-V2, ResNet-101, and ViT show DCP outperforms fixed and prior search-based approaches and generalizes to unseen HW settings with zero- or few-shot fine-tuning. The work enables rapid, hardware-aware dataflow customization, with practical impact for deploying efficient DNN accelerators on diverse hardware platforms.
Abstract
Deep neural network (DNN) hardware (HW) accelerators have achieved great success in improving DNNs' performance and efficiency. One key reason is dataflow in executing a DNN layer, including on-chip data partitioning, computation parallelism, and scheduling policy, which have large impacts on latency and energy consumption. Unlike prior works that required considerable efforts from HW engineers to design suitable dataflows for different DNNs, this work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort. It has several attractive benefits that prior arts do not have. (i) We translate the HW dataflow configuration into a code representation in a unified dataflow coding space, which can be optimized by backpropagating gradients given a DNN layer or network. (ii) DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives e.g., latency and energy. (iii) It can be easily generalized to unseen HW configurations in a zero-shot or few-shot learning manner. For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples. Extensive experiments on several representative models such as MobileNet, ResNet, and ViT show that DCP outperforms its counterparts in various settings.
