Drop-Connect as a Fault-Tolerance Approach for RRAM-based Deep Neural Network Accelerators
Mingyuan Xiang, Xuhan Xie, Pedro Savarese, Xin Yuan, Michael Maire, Yanjing Li
TL;DR
This work tackles the problem of deploying RRAM-based DNN accelerators despite hardware defects, particularly SA1 stuck-at faults. It proposes a drop-connect–inspired training method that mimics faults during learning, allowing networks to maintain accuracy without hardware changes or extra detection logic. Through systematic simulations on CIFAR-10 with VGG13, MobileNet V2, and ResNet20, the approach tolerates fault rates up to about $10\%$ with minimal degradation in several models, and explores design trade-offs such as widening networks and replacing 1x1 shortcuts with 3x3 kernels to recover accuracy. The findings highlight practical strategies for fault-tolerant RRAM deployment and emphasize nuanced system-level considerations when adapting machine learning techniques to hardware challenges.
Abstract
Resistive random-access memory (RRAM) is widely recognized as a promising emerging hardware platform for deep neural networks (DNNs). Yet, due to manufacturing limitations, current RRAM devices are highly susceptible to hardware defects, which poses a significant challenge to their practical applicability. In this paper, we present a machine learning technique that enables the deployment of defect-prone RRAM accelerators for DNN applications, without necessitating modifying the hardware, retraining of the neural network, or implementing additional detection circuitry/logic. The key idea involves incorporating a drop-connect inspired approach during the training phase of a DNN, where random subsets of weights are selected to emulate fault effects (e.g., set to zero to mimic stuck-at-1 faults), thereby equipping the DNN with the ability to learn and adapt to RRAM defects with the corresponding fault rates. Our results demonstrate the viability of the drop-connect approach, coupled with various algorithm and system-level design and trade-off considerations. We show that, even in the presence of high defect rates (e.g., up to 30%), the degradation of DNN accuracy can be as low as less than 1% compared to that of the fault-free version, while incurring minimal system-level runtime/energy costs.
