Table of Contents
Fetching ...

Pre-Defined Sparse Neural Networks with Hardware Acceleration

Sourya Dey, Kuan-Wen Huang, Peter A. Beerel, Keith M. Chugg

TL;DR

The results show that storage and computational complexity can be reduced by factors greater than 5X without significant performance loss.

Abstract

Neural networks have proven to be extremely powerful tools for modern artificial intelligence applications, but computational and storage complexity remain limiting factors. This paper presents two compatible contributions towards reducing the time, energy, computational, and storage complexities associated with multilayer perceptrons. Pre-defined sparsity is proposed to reduce the complexity during both training and inference, regardless of the implementation platform. Our results show that storage and computational complexity can be reduced by factors greater than 5X without significant performance loss. The second contribution is an architecture for hardware acceleration that is compatible with pre-defined sparsity. This architecture supports both training and inference modes and is flexible in the sense that it is not tied to a specific number of neurons. For example, this flexibility implies that various sized neural networks can be supported on various sized Field Programmable Gate Array (FPGA)s.

Pre-Defined Sparse Neural Networks with Hardware Acceleration

TL;DR

The results show that storage and computational complexity can be reduced by factors greater than 5X without significant performance loss.

Abstract

Neural networks have proven to be extremely powerful tools for modern artificial intelligence applications, but computational and storage complexity remain limiting factors. This paper presents two compatible contributions towards reducing the time, energy, computational, and storage complexities associated with multilayer perceptrons. Pre-defined sparsity is proposed to reduce the complexity during both training and inference, regardless of the implementation platform. Our results show that storage and computational complexity can be reduced by factors greater than 5X without significant performance loss. The second contribution is an architecture for hardware acceleration that is compatible with pre-defined sparsity. This architecture supports both training and inference modes and is flexible in the sense that it is not tied to a specific number of neurons. For example, this flexibility implies that various sized neural networks can be supported on various sized Field Programmable Gate Array (FPGA)s.

Paper Structure

This paper contains 28 sections, 4 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Histograms of weight values in different junctions for FC NNs trained on MNIST for 50 epochs, with (a-b) $\bm{N}_{\mathrm{net}} = (800,100,10)$, and (d-g) $\bm{N}_{\mathrm{net}} = (800,100,100,100,10)$. Test accuracy shown in (c,h) for different NNs with same $\bm{N}_{\mathrm{net}}$ and varying $\rho_{\mathrm{net}}$. The overall density $\rho_{\mathrm{net}}$ is set by reducing $\rho_1$ since junction 1 has more weights close to zero in the FC cases (circled).
  • Figure 2: (a) Processing $z_i = 3$ edges in each cycle (blue in cycle 0, pink in cycle 1) for some junction $i$. (b) Accessing $z_i = 3$ memories -- M0, M1 and M2 shown as columns -- from two separate banks, one in natural order (same address from each memory), the other in interleaved order. Clash-freedom is achieved by accessing only one element from each memory. The accessed values are fed to $z_i = 3$ processors to perform FF simultaneously. (c) Operational parallelism in each junction (vertical dotted lines denote processing for one junction), and junction pipelining of each operation across junctions (horizontal dashed lines) in a multi-junction NN. Subfigure (c) is modified from our previous conference publication Dey2017_ICANN
  • Figure 3: Architecture for parallel operations for an intermediate junction $i$ ($i \ne 1,L$) showing the three operations along with associated inputs and outputs. Natural and interleaved order accesses are shown using solid and dashed lines, respectively. The $\bm{a}$ and $\dot{\bm{a}}$ memory banks occur as queues, the $\bm{\delta}$ memory banks as pairs, while there is a single weight memory bank. Figure modified from our previous conference publication Dey2017_ICANN.
  • Figure 4: An example of processing inside junction $i$ with $z_i=4$ memories in the weight and left banks, and $z_{i+1}=2$ memories in the right bank. The banks are represented as numerical grids, each column is a memory, and the number in each cell is the number of the edge / left neuron / right neuron whose parameter value is stored in it. Edge are sequentially numbered on the right (shown in curly braces). Four weights are read in each of the six cycles with the first three colored blue, pink and green, respectively. These represent sweep 0, while the next 3 (using dashed lines) colored brown, red and purple, respectively, represent sweep 1. Clash-freedom leads to at most one cell from each memory in each bank being accessed each cycle. Weight and right memories are accessed in natural order, while left memories are accessed in interleaved order.
  • Figure 5: Processing the FC version of the junction from Fig. \ref{['fig-memaccesses']}. For clarity, only the first 12 and last 12 edges (dashed) are shown, corresponding respectively to right neurons 0 and 7, sweeps 0 and 7, cycles 0--2 and 21--23.
  • ...and 8 more figures