Table of Contents
Fetching ...

DCP: Learning Accelerator Dataflow for Neural Network via Propagation

Peng Xu, Wenqi Shao, Mingyu Ding, Ping Luo

TL;DR

DCP addresses the onerous task of hand-crafting accelerator dataflows by learning a differentiable optimization pipeline that maps unified encodings of DNN layers and dataflows to hardware metrics. A neural predictor is trained on a large, synthetic benchmark built from a 7D DNN-layer code $x$ and a $(3,7,2)$ dataflow code $y$, with metrics obtained via the MAESTRO-style simulator. After training, DCP updates dataflow codes through gradient-based backpropagation to minimize or trade off latency, energy, and EDP, at layer, model, or multi-objective levels, while enforcing HW constraints. Experiments across MobileNet-V2, ResNet-101, and ViT show DCP outperforms fixed and prior search-based approaches and generalizes to unseen HW settings with zero- or few-shot fine-tuning. The work enables rapid, hardware-aware dataflow customization, with practical impact for deploying efficient DNN accelerators on diverse hardware platforms.

Abstract

Deep neural network (DNN) hardware (HW) accelerators have achieved great success in improving DNNs' performance and efficiency. One key reason is dataflow in executing a DNN layer, including on-chip data partitioning, computation parallelism, and scheduling policy, which have large impacts on latency and energy consumption. Unlike prior works that required considerable efforts from HW engineers to design suitable dataflows for different DNNs, this work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort. It has several attractive benefits that prior arts do not have. (i) We translate the HW dataflow configuration into a code representation in a unified dataflow coding space, which can be optimized by backpropagating gradients given a DNN layer or network. (ii) DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives e.g., latency and energy. (iii) It can be easily generalized to unseen HW configurations in a zero-shot or few-shot learning manner. For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples. Extensive experiments on several representative models such as MobileNet, ResNet, and ViT show that DCP outperforms its counterparts in various settings.

DCP: Learning Accelerator Dataflow for Neural Network via Propagation

TL;DR

DCP addresses the onerous task of hand-crafting accelerator dataflows by learning a differentiable optimization pipeline that maps unified encodings of DNN layers and dataflows to hardware metrics. A neural predictor is trained on a large, synthetic benchmark built from a 7D DNN-layer code and a dataflow code , with metrics obtained via the MAESTRO-style simulator. After training, DCP updates dataflow codes through gradient-based backpropagation to minimize or trade off latency, energy, and EDP, at layer, model, or multi-objective levels, while enforcing HW constraints. Experiments across MobileNet-V2, ResNet-101, and ViT show DCP outperforms fixed and prior search-based approaches and generalizes to unseen HW settings with zero- or few-shot fine-tuning. The work enables rapid, hardware-aware dataflow customization, with practical impact for deploying efficient DNN accelerators on diverse hardware platforms.

Abstract

Deep neural network (DNN) hardware (HW) accelerators have achieved great success in improving DNNs' performance and efficiency. One key reason is dataflow in executing a DNN layer, including on-chip data partitioning, computation parallelism, and scheduling policy, which have large impacts on latency and energy consumption. Unlike prior works that required considerable efforts from HW engineers to design suitable dataflows for different DNNs, this work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort. It has several attractive benefits that prior arts do not have. (i) We translate the HW dataflow configuration into a code representation in a unified dataflow coding space, which can be optimized by backpropagating gradients given a DNN layer or network. (ii) DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives e.g., latency and energy. (iii) It can be easily generalized to unseen HW configurations in a zero-shot or few-shot learning manner. For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples. Extensive experiments on several representative models such as MobileNet, ResNet, and ViT show that DCP outperforms its counterparts in various settings.

Paper Structure

This paper contains 22 sections, 4 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Dataflow Comparisons and Explanations. (a) We compare the Energy-delay-product (EDP, lower is better) between NVDLA, Eyeriss, ShiDianNao, and our DCP. We see that DCP can achieve the best efficiency for all three visual models. (b) visualizes and compares the dataflow of ours and Eyeriss using the first layer of ResNet101. The layer dimensions have been tiled based on the partitioning size of dataflow. The red color labels the tiles to perform parallel computation, and the orange arrow implies the computation order, where $K, C, R, S, Y, X, Y', X'$ represent output/input channels, filter row/column, input row/column and output row/column respectively. Our learned dataflow costs only 18.9% read/writes compared to Eyeriss by (i) using a smaller kernel tiled size ($4 \times 4$ versus $1 \times 1$ in Eyeriss) and smaller tiled output channels ($8$ versus $4$ in Eyeriss) but a larger one for input ($3 \times 3$ versus $7 \times 7$ in Eyeriss) and (ii) different computation order of dimensions.
  • Figure 2: Example of DNN Accelerator. An abstract DNN accelerator architecture that contains a hierarchical memory system and PE arrays. This abstract DNN accelerator architecture is also used in many DNN accelerators eyerissTPUV1SCNNSnaPEA.
  • Figure 3: Coding representations of dataflow and DNN layer. DNN layer code is a seven-dimensional code that describes the DNN layer in the dimensions of $K, C, Y, X, R, S, T$. We use a dataflow optimized for MobileNet-v2 as an example. The dataflow code describes the three memory levels of the accelerator. The L1 dataflow code is an example, the first line is the index of seven dimensions, and the second line is their accompanying numbers. These seven dimensions contain six dimensions in the DNN layer code except T in computation order and a parallel dimension selected from them to perform parallel computation. The accompanying number of the parallel dimension specifies the number of PEs allocated in one cluster of L1 memory. In contrast, the partitioning size is the accompanying number of the rest six dimensions.
  • Figure 4: An overview of our Dataflow Code Propagation (DCP). DCP first builds a benchmark of dataflow and DNN layer and then trains a predictor to predict various HW metrics. Finally, we back-propagate the gradients of the fixed neural predictor to efficiently update the dataflow code towards the optimization objective (lower predicted metrics), conditioned on the layer code.
  • Figure 5: Regression performance of the neural predictor for the HW metrics of latency, energy, and power, respectively.
  • ...and 6 more figures