Table of Contents
Fetching ...

Drop-Connect as a Fault-Tolerance Approach for RRAM-based Deep Neural Network Accelerators

Mingyuan Xiang, Xuhan Xie, Pedro Savarese, Xin Yuan, Michael Maire, Yanjing Li

TL;DR

This work tackles the problem of deploying RRAM-based DNN accelerators despite hardware defects, particularly SA1 stuck-at faults. It proposes a drop-connect–inspired training method that mimics faults during learning, allowing networks to maintain accuracy without hardware changes or extra detection logic. Through systematic simulations on CIFAR-10 with VGG13, MobileNet V2, and ResNet20, the approach tolerates fault rates up to about $10\%$ with minimal degradation in several models, and explores design trade-offs such as widening networks and replacing 1x1 shortcuts with 3x3 kernels to recover accuracy. The findings highlight practical strategies for fault-tolerant RRAM deployment and emphasize nuanced system-level considerations when adapting machine learning techniques to hardware challenges.

Abstract

Resistive random-access memory (RRAM) is widely recognized as a promising emerging hardware platform for deep neural networks (DNNs). Yet, due to manufacturing limitations, current RRAM devices are highly susceptible to hardware defects, which poses a significant challenge to their practical applicability. In this paper, we present a machine learning technique that enables the deployment of defect-prone RRAM accelerators for DNN applications, without necessitating modifying the hardware, retraining of the neural network, or implementing additional detection circuitry/logic. The key idea involves incorporating a drop-connect inspired approach during the training phase of a DNN, where random subsets of weights are selected to emulate fault effects (e.g., set to zero to mimic stuck-at-1 faults), thereby equipping the DNN with the ability to learn and adapt to RRAM defects with the corresponding fault rates. Our results demonstrate the viability of the drop-connect approach, coupled with various algorithm and system-level design and trade-off considerations. We show that, even in the presence of high defect rates (e.g., up to 30%), the degradation of DNN accuracy can be as low as less than 1% compared to that of the fault-free version, while incurring minimal system-level runtime/energy costs.

Drop-Connect as a Fault-Tolerance Approach for RRAM-based Deep Neural Network Accelerators

TL;DR

This work tackles the problem of deploying RRAM-based DNN accelerators despite hardware defects, particularly SA1 stuck-at faults. It proposes a drop-connect–inspired training method that mimics faults during learning, allowing networks to maintain accuracy without hardware changes or extra detection logic. Through systematic simulations on CIFAR-10 with VGG13, MobileNet V2, and ResNet20, the approach tolerates fault rates up to about with minimal degradation in several models, and explores design trade-offs such as widening networks and replacing 1x1 shortcuts with 3x3 kernels to recover accuracy. The findings highlight practical strategies for fault-tolerant RRAM deployment and emphasize nuanced system-level considerations when adapting machine learning techniques to hardware challenges.

Abstract

Resistive random-access memory (RRAM) is widely recognized as a promising emerging hardware platform for deep neural networks (DNNs). Yet, due to manufacturing limitations, current RRAM devices are highly susceptible to hardware defects, which poses a significant challenge to their practical applicability. In this paper, we present a machine learning technique that enables the deployment of defect-prone RRAM accelerators for DNN applications, without necessitating modifying the hardware, retraining of the neural network, or implementing additional detection circuitry/logic. The key idea involves incorporating a drop-connect inspired approach during the training phase of a DNN, where random subsets of weights are selected to emulate fault effects (e.g., set to zero to mimic stuck-at-1 faults), thereby equipping the DNN with the ability to learn and adapt to RRAM defects with the corresponding fault rates. Our results demonstrate the viability of the drop-connect approach, coupled with various algorithm and system-level design and trade-off considerations. We show that, even in the presence of high defect rates (e.g., up to 30%), the degradation of DNN accuracy can be as low as less than 1% compared to that of the fault-free version, while incurring minimal system-level runtime/energy costs.
Paper Structure (13 sections, 1 equation, 5 figures, 2 tables, 1 algorithm)

This paper contains 13 sections, 1 equation, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Example of a RRAM crossbar.
  • Figure 2: Increasing network width (red) and/or increasing kernel size of 1x1 convolution layers (blue) to compensate for information loss due to drop-connect.
  • Figure 3: Network Accuracy of VGG13 for different drop-connect and fault rates.
  • Figure 6: (a) and (b) the combinations of the fault rate and the drop-connect rate that achieve the highest network accuracy for different network width; (c) increasing the size of kernels in short-cut layers from 1x1 to 3x3.
  • Figure 7: ResNet20, comparison of fault-free 1x1 convolution layers and when drop-connect and SA1 faults are applied to the same layers.