Table of Contents
Fetching ...

Sparsity-Aware Optimization of In-Memory Bayesian Binary Neural Network Accelerators

Prabodh Katti, Bashir M. Al-Hashimi, Bipin Rajendran

TL;DR

This work proposes a novel sparsity-aware optimization for Bayesian Binary Neural Network (BBNN) accelerators that exploits the inherent BBNN sampling sparsity -- most of the network is made up of synapses that have a high probability of being fixed at $\pm1$ and require no sampling.

Abstract

Bayesian Neural Networks (BNNs) provide principled estimates of model and data uncertainty by encoding parameters as distributions. This makes them key enablers for reliable AI that can be deployed on safety critical edge systems. These systems can be made resource efficient by restricting synapses to two synaptic states $\{-1,+1\}$ and using a memristive in-memory computing (IMC) paradigm. However, BNNs pose an additional challenge -- they require multiple instantiations for ensembling, consuming extra resources in terms of energy and area. In this work, we propose a novel sparsity-aware optimization for Bayesian Binary Neural Network (BBNN) accelerators that exploits the inherent BBNN sampling sparsity -- most of the network is made up of synapses that have a high probability of being fixed at $\pm1$ and require no sampling. The optimization scheme proposed here exploits the sampling sparsity that exists both among layers, i.e only a few layers of the network contain a majority of the probabilistic synapses, as well as the parameters i.e., a tiny fraction of parameters in these layers require sampling, reducing total sampled parameter count further by up to $86\%$. We demonstrate no loss in accuracy or uncertainty quantification performance for a VGGBinaryConnect network on CIFAR-100 dataset mapped on a custom sparsity-aware phase change memory (PCM) based IMC simulator. We also develop a simple drift compensation technique to demonstrate robustness to drift-induced degradation. Finally, we project latency, energy, and area for sparsity-aware BNN implementation in both pipelined and non-pipelined modes. With sparsity-aware implementation, we estimate upto $5.3 \times$ reduction in area and $8.8\times$ reduction in energy compared to a non-sparsity-aware implementation. Our approach also results in $2.9 \times $ more power efficiency compared to the state-of-the-art BNN accelerator.

Sparsity-Aware Optimization of In-Memory Bayesian Binary Neural Network Accelerators

TL;DR

This work proposes a novel sparsity-aware optimization for Bayesian Binary Neural Network (BBNN) accelerators that exploits the inherent BBNN sampling sparsity -- most of the network is made up of synapses that have a high probability of being fixed at and require no sampling.

Abstract

Bayesian Neural Networks (BNNs) provide principled estimates of model and data uncertainty by encoding parameters as distributions. This makes them key enablers for reliable AI that can be deployed on safety critical edge systems. These systems can be made resource efficient by restricting synapses to two synaptic states and using a memristive in-memory computing (IMC) paradigm. However, BNNs pose an additional challenge -- they require multiple instantiations for ensembling, consuming extra resources in terms of energy and area. In this work, we propose a novel sparsity-aware optimization for Bayesian Binary Neural Network (BBNN) accelerators that exploits the inherent BBNN sampling sparsity -- most of the network is made up of synapses that have a high probability of being fixed at and require no sampling. The optimization scheme proposed here exploits the sampling sparsity that exists both among layers, i.e only a few layers of the network contain a majority of the probabilistic synapses, as well as the parameters i.e., a tiny fraction of parameters in these layers require sampling, reducing total sampled parameter count further by up to . We demonstrate no loss in accuracy or uncertainty quantification performance for a VGGBinaryConnect network on CIFAR-100 dataset mapped on a custom sparsity-aware phase change memory (PCM) based IMC simulator. We also develop a simple drift compensation technique to demonstrate robustness to drift-induced degradation. Finally, we project latency, energy, and area for sparsity-aware BNN implementation in both pipelined and non-pipelined modes. With sparsity-aware implementation, we estimate upto reduction in area and reduction in energy compared to a non-sparsity-aware implementation. Our approach also results in more power efficiency compared to the state-of-the-art BNN accelerator.

Paper Structure

This paper contains 5 sections, 3 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Left: Illustration of BBNN inference, where only a small fraction of synapses are probabilistic and thus participate in sampling. Center: Sparsity-aware optimization to exploit the layer sparsity of VGGBinaryConnect (only synaptic layers shown here). Only layers with a significant concentration of probabilistic synapses are utilized to create an ensemble of predictions, all available simultaneously. Right: Utilizing parameter sampling sparsity within these layers by separating rows with probabilistic synapses into 'Stochastic Plane (SP)', and the rest into 'Deterministic Plane (DP)'. The DP and one of the SP ensembles provide one set of predictions. Thus, predictions are available one by one.
  • Figure 2: Probabilistic synaptic parameter concentration across layers. The bar plot depicts the fraction of the layer that is probabilistic ($n^l_p/n^l$). The line plot shows the layerwise proportion of all probabilistic synapses ($n^l_p/n_p$).
  • Figure 3: The accelerator chip architecture, based on peng2019dnn+.
  • Figure 4: Left: Partitioning matrix into $\boldsymbol{W}^D$ and $\boldsymbol{W}_i^S$. Yellow-colored cells represent probabilistic synapses, while blue ones represent deterministic. Right: DP-SP split and MVM operation for submatrix IV. For an $f_p=0.33$, the $N_{MC}=10$ submatrices will require $4$ subarrays. Inputs corresponding to non-zero rows are sliced and fed to each $\boldsymbol{W}_{i,sub}^S$ of the SP.
  • Figure 5: Timing analysis for pipelined and non-pipelined mode for both LS and LS+RS modes. $\times 10$ highlights the $10$ ensembles that are processed sequentially. Stage numbers highlighted in red in pipelined mode decide stage latency.
  • ...and 2 more figures