Table of Contents
Fetching ...

Performance Analysis of DNN Inference/Training with Convolution and non-Convolution Operations

Hadi Esmaeilzadeh, Soroush Ghodrati, Andrew B. Kahng, Sean Kinzer, Susmita Dey Manasi, Sachin S. Sapatnekar, Zhiang Wang

TL;DR

SimDIT delivers a comprehensive framework for end-to-end performance analysis of CNN inference and training on ASIC-based systolic accelerators, covering both convolutional and non-convolutional operations. It introduces a tile-based modeling approach that maps Conv to a GEMM-like workload on a systolic array and handles non-Conv layers on a SIMD array, with data-access and cycle-count models extended to backward gradients and BN training. Through integration with backend power-performance data, SimDIT enables design-space exploration to optimize on-chip memory and off-chip bandwidth, revealing substantial gains (e.g., up to 18X inference speedups) and highlighting the non-trivial runtime/energy contributions of non-conv layers during training (e.g., 59.5% runtime). The work provides actionable guidance for ASIC accelerator design, demonstrating both optimal resource distribution and economical design points with manageable performance penalties, thereby facilitating practical optimization of DNN accelerators for both inference and training.

Abstract

Today's performance analysis frameworks for deep learning accelerators suffer from two significant limitations. First, although modern convolutional neural network (CNNs) consist of many types of layers other than convolution, especially during training, these frameworks largely focus on convolution layers only. Second, these frameworks are generally targeted towards inference, and lack support for training operations. This work proposes a novel performance analysis framework, SimDIT, for general ASIC-based systolic hardware accelerator platforms. The modeling effort of SimDIT comprehensively covers convolution and non-convolution operations of both CNN inference and training on a highly parameterizable hardware substrate. SimDIT is integrated with a backend silicon implementation flow and provides detailed end-to-end performance statistics (i.e., data access cost, cycle counts, energy, and power) for executing CNN inference and training workloads. SimDIT-enabled performance analysis reveals that on a 64X64 processing array, non-convolution operations constitute 59.5% of total runtime for ResNet-50 training workload. In addition, by optimally distributing available off-chip DRAM bandwidth and on-chip SRAM resources, SimDIT achieves 18X performance improvement over a generic static resource allocation for ResNet-50 inference.

Performance Analysis of DNN Inference/Training with Convolution and non-Convolution Operations

TL;DR

SimDIT delivers a comprehensive framework for end-to-end performance analysis of CNN inference and training on ASIC-based systolic accelerators, covering both convolutional and non-convolutional operations. It introduces a tile-based modeling approach that maps Conv to a GEMM-like workload on a systolic array and handles non-Conv layers on a SIMD array, with data-access and cycle-count models extended to backward gradients and BN training. Through integration with backend power-performance data, SimDIT enables design-space exploration to optimize on-chip memory and off-chip bandwidth, revealing substantial gains (e.g., up to 18X inference speedups) and highlighting the non-trivial runtime/energy contributions of non-conv layers during training (e.g., 59.5% runtime). The work provides actionable guidance for ASIC accelerator design, demonstrating both optimal resource distribution and economical design points with manageable performance penalties, thereby facilitating practical optimization of DNN accelerators for both inference and training.

Abstract

Today's performance analysis frameworks for deep learning accelerators suffer from two significant limitations. First, although modern convolutional neural network (CNNs) consist of many types of layers other than convolution, especially during training, these frameworks largely focus on convolution layers only. Second, these frameworks are generally targeted towards inference, and lack support for training operations. This work proposes a novel performance analysis framework, SimDIT, for general ASIC-based systolic hardware accelerator platforms. The modeling effort of SimDIT comprehensively covers convolution and non-convolution operations of both CNN inference and training on a highly parameterizable hardware substrate. SimDIT is integrated with a backend silicon implementation flow and provides detailed end-to-end performance statistics (i.e., data access cost, cycle counts, energy, and power) for executing CNN inference and training workloads. SimDIT-enabled performance analysis reveals that on a 64X64 processing array, non-convolution operations constitute 59.5% of total runtime for ResNet-50 training workload. In addition, by optimally distributing available off-chip DRAM bandwidth and on-chip SRAM resources, SimDIT achieves 18X performance improvement over a generic static resource allocation for ResNet-50 inference.
Paper Structure (20 sections, 34 equations, 12 figures, 10 tables, 1 algorithm)

This paper contains 20 sections, 34 equations, 12 figures, 10 tables, 1 algorithm.

Figures (12)

  • Figure 1: General overview of SimDIT.
  • Figure 2: Block diagram of the system-level hardware architecture.
  • Figure 3: Convolution layer illustrating filter, ifmap, and ofmap.
  • Figure 4: A tiling template for a convolution layer. The tiling parameters for batch dimension, $T_n$ and $t_n$, are omitted for simplicity.
  • Figure 5: Comparison of SimDIT cycle counts with No-Stall and Simplified cases using four Conv layers of ResNet-50 for both inference (Layer1, Layer2) and training (Layer3, Layer3) phases.
  • ...and 7 more figures